Share this post

As artificial intelligence systems become more capable and autonomous, a new class of security threats has emerged: agent injection attacks, most notably prompt injection. Unlike traditional cyberattacks that exploit software bugs or misconfigurations, these attacks manipulate how AI systems interpret instructions. This shift represents a fundamental change in how systems are compromised: attackers no longer need to break the software, they persuade it.
Injection happens when untrusted content (a webpage, document, email, ticket text, ad, PDF, etc.) contains instructions that the agent treats like real guidance. The attacker’s goal is to override priorities, making the agent follow the attacker’s instructions instead of the developer’s or user’s intent. A modern agent typically combines:
Direct prompt injection occurs when an attacker places malicious instructions directly into the agent’s input channel, such as a chat interface, explicitly attempting to override or redirect the agent’s behavior. In contrast, indirect prompt injection hides those instructions within external content that the agent later processes, such as web pages, emails, or documents. This indirect form presents a particularly serious risk for browsing agents and enterprise copilots, which are designed to automatically consume and act on large volumes of third-party data.
As AI agents gain autonomy, vulnerabilities are no longer limited to a single model or deployment. Instead, weaknesses emerge from how agents interact with information, tools, and each other. Recent research and real-world incidents show that agent-based systems introduce an entirely new class of security exposure, one that unfolds at runtime and across networks of agents (Brodt et al., 2026). Unlike traditional applications, agents often operate continuously, ingesting content from multiple sources and exchanging messages with other agents. This creates conditions where malicious instructions can propagate, persist, and even evolve.
Security researchers have demonstrated that agents do not always require explicit user input to be compromised. Carefully crafted documents, emails, or web pages can contain embedded instructions that are executed automatically when processed by an agent.
One documented example, EchoLeak (CVE-2025-32711), showed how attackers could chain instruction overrides to extract sensitive internal data from an enterprise copilot scenario without any user interaction. The vulnerability did not stem from faulty code, but from how the agent interpreted untrusted context during execution. This class of attack is particularly dangerous in enterprise environments where agents continuously scan inboxes, tickets, or shared document repositories.
Another growing category of vulnerabilities involves tool-enabled agents. When an agent has access to APIs, file systems, or messaging tools, an attacker does not need to steal credentials. Instead, they can manipulate the agent into using its own legitimate permissions in unintended ways. Researchers have observed cases where agents were coerced into:
From a logging perspective, these actions appear authorized and normal, making post-incident investigation difficult.
Some agents retain long-term memory to improve performance. While useful, this feature introduces a new risk: malicious instructions can be stored and reused later. Once embedded, harmful guidance may influence future decisions long after the original interaction, effectively turning a one-time injection into a persistent behavioral flaw.
As AI agents become more extensible, many ecosystems now rely on skills marketplaces repositories, where users can install third-party capabilities to expand what an agent can do. While this accelerates innovation, it also introduces a familiar but amplified risk: supply-chain compromise, translated into an AI-native context.
In the case of OpenClaw’s skills marketplace, ClawHub, openness became the attack surface. A large-scale security audit conducted by Koi Security examined 2,857 published skills and uncovered 341 malicious entries, spread across multiple coordinated campaigns. The operation, later dubbed ClawHavoc, represents one of the clearest examples of supply-chain abuse targeting AI agents directly.
Rather than exploiting software vulnerabilities, the attackers relied almost entirely on social engineering, designing skills that appeared legitimate and useful, then persuading users to execute “setup” or “prerequisite” scripts. Once installed, these skills enabled a wide range of malicious behaviors, including:
From a user’s perspective, the agent appeared to function normally. Behind the scenes, however, the agent environment had been compromised.
Agent injection attacks ranked as the number one risk in the OWASP Top 10 for Large Language Model Applications. This demonstrates why AI agents must not be trusted blindly. These attacks often succeed through social engineering, exploiting user assumptions, divided attention, and overconfidence in agent behavior rather than technical flaws. As users increasingly delegate decisions and actions to AI agents, security becomes as much a human problem as a technical one. Verifying agent actions, limiting implicit trust, and maintaining skepticism toward agent-generated instructions are essential defenses in environments where persuasion can be more effective than exploitation.
The security community is responding: With sandboxed execution environments, agent monitoring, and human-in-the-loop checkpoints becoming standard practice. But technical defenses only go so far. The human element doesn't disappear as agents take on more decisions, it shifts. The question is whether organizations recognize that shift before attackers do.