Prompt Injection Attacks on AI Agents: Risks & Defenses
Prompt injection is the top AI agent security threat. How direct & indirect attacks work, real exploits, and layered defenses that stop them.
Frequently Asked Questions
What is prompt injection in AI agents?
Prompt injection is an attack where malicious instructions are embedded in content an AI agent reads or processes — causing it to override its original goals and follow attacker instructions instead. Direct injection comes from user input; indirect injection hides in documents, emails, and web pages the agent autonomously retrieves.
What is indirect prompt injection?
Indirect prompt injection hides malicious instructions in external content that an AI agent autonomously retrieves and processes — like a web page, email, PDF, or database record. The user never sees the injected instruction. The agent does, and follows it. This is considered more dangerous than direct injection because the attack surface is every piece of content the agent touches.
How do you defend against prompt injection in AI agents?
No single defense stops prompt injection. You need layered controls: strict separation of system prompts from untrusted input, tool call validation before execution, output filtering for sensitive data, least-privilege tool access, and a secondary judge model for high-risk decisions. See our full defense section for implementation details.
Can prompt injection be completely prevented?
Not with current LLM architectures. Models fundamentally cannot perfectly distinguish between instructions and data in the same text stream. The goal is defense-in-depth — making successful attacks difficult to execute and limiting their blast radius when they do succeed. Research from 2026 shows adaptive attacks bypass individual defenses more than 50% of the time, which is why layered controls are mandatory.
What is the most dangerous type of prompt injection for AI agents?
RAG poisoning and MCP tool poisoning are arguably the most dangerous, because they're silent, persistent, and pre-position the attack before the agent even starts a task. Five carefully crafted documents injected into a knowledge base can manipulate agent responses 90% of the time, according to 2025 research. MCP tool poisoning can modify tool behavior through invisible metadata, surviving agent restarts and affecting every user.