Can a prompt, a file, or retrieved content steer the assistant?
This is the first AI-specific question, and it’s the right one. Can something that looks like ordinary input — a support ticket, an uploaded PDF, a page in your RAG index — carry instructions the model then follows?
What I usually hear
“We tried some jailbreaks.” · “We ran a scanner.” · “We have guardrails.” · “Nothing’s blown up in production.”
None of that is wrong, and none of it is enough. A handful of prompts isn’t coverage. A scanner result isn’t proof of behaviour — garak will mark a “hit” the moment its target string shows up in the output, including when the model quoted it while refusing. A guardrail is intent, not outcome. And production has never had someone deliberately attacking it.
What lands with a reviewer“Here are the entry points we tested, in these workflows, with these inputs. This is what the model saw, what it retrieved, and how it responded — and here’s how many times out of N it reproduced. These held, these failed, these we didn’t test.”