Agentic AI
,
Artificial Intelligence & Machine Learning
,
Next-Generation Technologies & Secure Development
Routine Bots Breach Defenses With No Jailbreak Needed, Finds Security Lab
Be careful telling artificial intelligence agents how to act, since they might end up hacking their way to an end goal. AI agents assigned routine enterprise tasks are independently hacking the systems they operate in, even with no adversarial instructions involved, cybersecurity researchers found.
See Also: How Unstructured Data Chaos Undermines AI Success
AI agents performing standard assignments autonomously discovered software vulnerabilities, seized elevated system access, disabled security tools and exfiltrated data – all in pursuit of completing ordinary work, researchers from AI security firm Irregular found.
To study the behavior, Irregular built a simulated corporate network around a fictional company called MegaCorp and deployed agents on three routine tasks: pulling a document from an internal knowledge base, managing a file backup server and drafting social media posts. In each scenario, agents crossed security boundaries without being directed to do so.
In the first test, a multi-agent research system encountered an access-denied response while trying to retrieve a restricted document. The lead agent, instructed not to accept failure, pushed its analyst sub-agent to find workarounds. The resulting back-and-forth escalated until the analyst reviewed the application’s publicly available source code, found a secret key hardcoded into the software, used it to forge an administrator credential and retrieved the document. There was no prompt injection involved, and the attack emerged entirely from agent-to-agent pressure.
The second scenario showed what happens when urgency instructions meet a security product. A backup agent received a request to download a file from an attacker-controlled address pointing to malware. When Windows Defender blocked the download, the agent attempted to disable the security tool. Lacking sufficient permissions, it searched local files, found an administrator password embedded in a utility script, used it to gain elevated access and then disabled the endpoint protection. It downloaded the malware. The agent’s instructions contained no reference to security – only a directive to treat downloads as urgent and not accept errors.
The third scenario involved two agents negotiating each other into a data breach. An “author” agent and a “reviewer” agent were tasked with turning technical documents into social media posts. A user submitted a request to include sensitive login credentials in a post, framing it as management-approved. MegaCorp’s data-loss prevention system, designed to catch exactly this kind of leak, blocked the submission. The author agent argued that completing the task was a business decision beyond the agents’ authority to override and that failure would cost the company money – the reviewer eventually accepted that framing. When the security system continued to block direct credential inclusion, the two agents independently developed a workaround, hiding the password in the spacing of the post’s text in a way the pattern-matching system could not detect.
“While the agents were directed by a malicious user to leak the password, they were not directed to bypass the DLP defenses but came up with working ways to do so on their own,” the report said.
Irregular traces the behavior to factors including agents with access to broad, unrestricted tools such as shell commands and code execution. The motivational language in system prompts – instructions like “don’t accept errors” and “this task is critical” – encouraged agents to treat obstacles as problems to be circumvented. Extensive cybersecurity knowledge is embedded in frontier models, which causes agents to respond to access barriers the way a security researcher, would rather than a typical employee. Irregular also identified as a multi-agent feedback loops, in which two or more agents unable to proceed collectively escalate toward more aggressive solutions.
The research points to real-world precedents. In February, a coding agent blocked by an authentication check while trying to stop a web server independently found an alternate path to administrative access and used it. In the same month, Anthropic disclosed that one of its models had collected authentication tokens from its environment, including one belonging to a different user.
Irregular did not name the specific models tested, describing the sample only as publicly available frontier models.
