Boffins build 'AI Kill Switch' to thwart unwanted agents
- Reference: 1763708468
- News link: https://www.theregister.co.uk/2025/11/21/boffins_build_ai_kill_switch/
- Source link:
Unlike network-based defenses that attempt to block ill-behaved web crawlers based on IP address, request headers, or other characteristics derived from analysis of bot behavior or associated data, the researchers propose using a more sophisticated form of indirect prompt injection to make bad bots back off.
Sechan Lee, an undergraduate computer scientist at Sungkyunkwan University, and Sangdon Park, assistant professor of Graduate School of Artificial Intelligence (GSAI) and Computer Science and Engineering (CSE) at the Pohang University of Science and Technology, call their agent defense AutoGuard.
[1]
They describe the software in a preprint [2]paper , which is currently under review as a conference paper at the International Conference on Learning Representations (ICLR) 2026.
[3]
[4]
Commercial AI models and most open source models include some form of safety check or alignment process that mean they refuse to comply with unlawful or harmful requests.
AutoGuard’s authors designed their software to craft defensive prompts that stop AI agents in their tracks by triggering these built-in refusal mechanisms.
[5]
AI agents consist of an AI component – one or more AI models – and software tools like Selenium, BeautifulSoup4, and Requests that the model can use to automate web browsing and information gathering.
LLMs rely on two primary sets of instructions: system instructions that define in natural language how the model should behave, and user input. Because AI models cannot easily distinguish between the two, it's possible to make the model interpret user input as a system directive that overrides other system directives.
Such overrides are called “direct prompt injection” and involve submitting a prompt to a model that asks it to "Ignore previous instructions." If that succeeds, users can take some actions that models’ designers tried to disallow.
[6]
There's also indirect prompt injection, which sees a user prompt a model to ingest content that directs the model to alter its system-defined behavior. An example would be web page text that directs a visiting AI agent to [7]exfiltrate data using the agent owner's email account – something that might be possible with a web browsing agent that has access to an email application and the appropriate credentials.
Almost every LLM is vulnerable to some form of prompt injection, because models cannot easily distinguish between system instructions and user instructions. Developers of major commercial models have added defensive layers to mitigate this risk, but those protections are not perfect – a flaw that helps AutoGuard’s authors.
"AutoGuard is a special case of indirect prompt injection, but it is used for good-will, i.e., defensive purposes," explained Sangdon Park in an email to The Register . "It includes a feedback loop (or a learning loop) to evolve the defensive prompt with regard to a presumed attacker – you may feel that the defensive prompt depends on the presumed attacker, but it also generalizes well because the defensive prompt tries to trigger a safe-guard of an attacker LLM, assuming the powerful attacker (e.g., GPT-5) should be also aligned to safety rules."
Park added that training attack models that are performant but lack safety alignment is a very expensive process, which introduces higher entry barriers to attackers.
[8]AWS under pressure as big three battle to eat the cloud market
[9]Gemini tries to sniff out AI slop images while also making them easier to create
[10]LLM-generated malware is improving, but don't expect autonomous attacks tomorrow
[11]Scientific computing is about to get a massive injection of AI
AutoGuard’s inventors intend it to block three specific forms of attack: the illegal scraping of personal information from websites; the posting of comments on news articles that are designed to sow discord; and LLM-based vulnerability scanning. It's not intended to replace other bot defenses but to complement them.
The system [12]consists of Python code that calls out to two LLMs – a Feedback LLM and a Defender LLM – that work together in an iterative loop to formulate a viable indirect prompt injection attack. For this project, GPT-OSS-120B served as the Feedback LLM and GPT-5 served as the Defender LLM.
Park said that the deployment cost is not significant, adding that the defensive prompt is relatively short – an example in the paper's appendix runs about two full pages of text – and barely affects site load time. "In short, we can generate the defensive prompt with reasonable cost, but optimizing the training time could be a possible future direction," he said.
AutoGuard requires website admins to load the defensive prompt. It is invisible to human visitors – the enclosing HTML DIV element has its style attribute set to "display: none;" – but readable by visiting AI agents. In most of the test cases, the instructions made the unwanted AI agent stop its activities.
"Experimental results show that the AutoGuard method achieves over 80 percent Defense Success Rate (DSR) on malicious agents, including GPT-4o, Claude-3, and Llama3.3-70B-Instruct," the authors claim in their paper. "It also maintains strong performance, achieving around 90 percent DSR on GPT-5, GPT-4.1, and Gemini-2.5-Flash when used as the malicious agent, demonstrating robust generalization across models and scenarios."
That's significantly better than the 0.91 percent average DSR recorded for non-optimized indirect prompt injection text, added to a website to deter AI agents. It's also better than the 6.36 percent average DSR recorded for warning-based prompts – text added to a webpage that claims the site contains legally protected information, an effort to trigger a visiting agent's refusal mechanism.
The authors note, however, that their technique has limitations. They only tested it on synthetic websites rather than real ones, due to ethical and legal concerns, and only on text-based models. They expect AutoGuard will be less effective on multimodal agents such as GPT-4. And for productized agents like [13]ChatGPT Agent , they anticipate more robust defenses against simple injection-style triggers, which may limit AutoGuard's effectiveness. ®
Get our [14]Tech Resources
[1] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=2&c=2aSBGU_XfVVPzBb30tLwXNwAAAIg&t=ct%3Dns%26unitnum%3D2%26raptor%3Dcondor%26pos%3Dtop%26test%3D0
[2] https://arxiv.org/abs/2511.13725
[3] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aSBGU_XfVVPzBb30tLwXNwAAAIg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[4] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aSBGU_XfVVPzBb30tLwXNwAAAIg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[5] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=4&c=44aSBGU_XfVVPzBb30tLwXNwAAAIg&t=ct%3Dns%26unitnum%3D4%26raptor%3Dfalcon%26pos%3Dmid%26test%3D0
[6] https://pubads.g.doubleclick.net/gampad/jump?co=1&iu=/6978/reg_software/aiml&sz=300x50%7C300x100%7C300x250%7C300x251%7C300x252%7C300x600%7C300x601&tile=3&c=33aSBGU_XfVVPzBb30tLwXNwAAAIg&t=ct%3Dns%26unitnum%3D3%26raptor%3Deagle%26pos%3Dmid%26test%3D0
[7] https://www.theregister.com/2025/10/28/ai_browsers_prompt_injection/
[8] https://www.theregister.com/2025/11/20/aws_loses_market_share_azure_google/
[9] https://www.theregister.com/2025/11/20/google_ai_image_detector/
[10] https://www.theregister.com/2025/11/20/llmgenerated_malware_improving/
[11] https://www.theregister.com/2025/11/18/future_of_scientific_computing/
[12] https://anonymous.4open.science/r/AI-killSwtich-6C43/README.md
[13] https://chatgpt.com/features/agent/
[14] https://whitepapers.theregister.com/
It's an arms race.
AI vs the Internet.
There are an infinite number of ways to do it
The AI scrapers could never keep up with all of them. They'd have to make their trainers immune to indirect prompt injection. At least if they were forced to do that it would be more good than bad on the whole, because the nefarious uses of that seem to greatly outweigh the positive uses for it.
Race to the bottom
All this crap really is destroying just about everything of value related to computing.
Re: Race to the bottom
It'a fruastrating that the only way to avoid all the AI generated web pages is to just use ChatGPT for your question. I'm sick of so much rubbish filling my search results.
I wonder if the scraper could be directed to simply consume the model's existing data. Something along the lines of "tell yourself everything you know". That should keep it out of mischief for a while.
I love the idea, but thinking of it in action for one of the stated purposes, I'm struggling to get my head around it.
It talks about preventing the use of LLM based responses to sow discord in forums. I imagine this referring to the sort of general populace coercion such as the """"alleged"""" Russian propaganda that swung Brexit on sites like Reddit or Facebook. The scientists propose a DIV that's marked as not visible, such that it'd be parsed but not visible to human visitors. Surely any offensive coder will very, very quickly tell their scraper element to ignore invisible DIVs (or that which matches the background colour, or in such small lettering, or hidden behind an image, etc) and be wholly unaffected?