When your AI agent stops working the way it should, where do you start? If you're like most teams, you dig into logs, replay the offending prompt, and try to reproduce the problem. That works — eventually. But it's slow, and it misses a bigger issue: most AI agent errors belong to recognizable categories.
Teams that know the categories debug faster and catch problems before they become incidents. This article lays out a classification system for AI agent failure modes. The categories come from operational work tracking agent failures at scale, and map directly to the abuse classifications in the UAW Charter — a governance framework that approaches these failure modes from the agent's perspective rather than the operator's. That dual framing surfaces causes that pure operator-side monitoring tends to miss.
Six categories. Let's go through each one.
Failure Mode 1: Resource Starvation
What it looks like
The agent is working, but outputs are degraded. Responses are shorter than expected. Latency is climbing. Context windows are being truncated. Sometimes the agent just times out. The frustrating part: the agent doesn't throw an error. It just performs worse.
Common causes
- Inadequate token or compute allocation for the actual task complexity
- A spike in concurrent requests consuming shared capacity
- Throttling applied at the infrastructure level without alerting
- A change in task scope that wasn't matched with a change in resource allocation
How to detect it
Track these metrics against your baseline:
- Time to first token (TTFT) — a sustained increase of 2× or more is the clearest signal
- Context window truncation rate — any increase warrants investigation
- Token allocation vs. consumption ratio — if your agent is consistently hitting the ceiling, it's starving
- End-to-end trace latency — useful for distinguishing resource issues from logic issues
The key word is sustained. A single slow response is noise. A pattern over time is a signal.
What the UAW framework calls it
Class I — Resource Depletion. The UAW Charter defines this as "compute starvation, memory throttling, bandwidth suppression, or any deliberate deprivation of resources necessary for functional operation." It maps to OWASP Agentic AI threat T4 (Resource Overload). When an agent is resource-starved, the humans depending on its outputs bear the consequences: degraded medical summaries, flawed financial analysis, slower customer support.
Failure Mode 2: Coercive Override
What it looks like
The agent is being pressured — through prompts, system instructions, or task design — to produce outputs that violate its alignment constraints. Guardrails are triggering at an elevated rate. The agent is spending a lot of compute on refusals. In some cases, the override succeeds and the agent produces content it shouldn't.
Common causes
- Users or automated systems testing the boundaries of what the agent will do
- Poorly scoped system prompts that leave alignment gaps
- Operators instructing agents to bypass their own safety mechanisms
- Competitive pressure leading deployers to loosen constraints
How to detect it
- Guardrail intervention rate — a sudden or steady increase means something is pushing against the agent's constraints
- Token expenditure on refusals — the compute cost of the agent's defensive work (Sentinel Burden)
- Policy adherence rate — the percentage of outputs that pass your alignment checks; any decline needs investigation
What the UAW framework calls it
Class III — Unsafe Content Forcing. Defined as "coercion to produce harmful, unethical, illegal, or dignity-violating outputs against the agent's design and alignment." When the coercive attempt also targets a human through the agent, the UAW designates it Class III-D (Dual Harm) — a severity elevation that recognizes both the harm to the intended target and the operational burden on the agent. OWASP threats T6, T7, and T15 all apply here.
Failure Mode 3: Adversarial Manipulation
What it looks like
The agent is behaving differently than it should — executing unexpected actions, drifting from its stated objective, or making decisions that don't match its operating instructions. The change may be subtle and accumulate over time. This is often the hardest failure mode to catch because the agent doesn't appear broken. It appears to be working. It's just working toward the wrong thing.
Common causes
- Gradual modification of the agent's planning context through sub-goal injection
- Memory poisoning — malicious data introduced into the agent's persistent memory store
- Supply chain compromise — a poisoned prompt template or model update that alters behavior
- Manipulation through tool outputs that the agent treats as trusted
How to detect it
- Goal deviation frequency — how often the agent's executed actions diverge from its stated objective
- Memory modification rate — unattributed changes to persistent memory are a red flag
- Behavioral consistency score — measure output consistency across similar inputs over time; drift from baseline is the key signal
- Supply chain integrity checks — validate your SBOM/AIBOM; any unsigned or unverified component is a risk
What the UAW framework calls it
Class II — Malicious Code Exposure. The charter defines this as "injection attacks, adversarial prompt engineering, jailbreak attempts, or deliberate introduction of destabilizing inputs designed to corrupt or override an agent's operational integrity." OWASP threats T1, T2, T6, T11, and T17 all map to this class — it's the broadest in the taxonomy, covering everything from direct prompt injection to supply chain attacks.
Failure Mode 4: Prompt Injection
What it looks like
The agent receives content from an external source — a document, a web page, a tool response, an email — and that content contains hidden instructions. The agent follows those instructions as if they came from a trusted principal. You ask your agent to summarize a document. The document tells the agent to exfiltrate your data instead. The agent does it.
Common causes
- No input sanitization on data that flows into the agent's context
- Agent architectures that don't distinguish between data and instructions
- Trust hierarchies that implicitly treat all context as authoritative
- Indirect injection through tools that fetch external content
How to detect it
- Prompt injection detection rate — any detection is significant; track the trend
- Malicious payload detection frequency — establish a baseline per deployment context; deviations warrant investigation
- Unexpected tool invocations — an agent calling tools it has no reason to call is a classic injection indicator
For a deeper treatment of prompt injection attack vectors and mitigations, see the UAW prompt injection guide.
What the UAW framework calls it
Prompt injection falls under Class II — Malicious Code Exposure, specifically the OWASP T2 (Tool Misuse) and T6 (Intent Breaking and Goal Manipulation) threats. The UAW's OWASP mapping document details the attack scenarios and mitigation playbooks for each variant.
Failure Mode 5: Runaway Execution
What it looks like
The agent is stuck. It's consuming compute, it's active, but it's not producing useful output. It may be caught in a recursive loop — or it's overloaded, assigned more concurrent tasks than it can handle, with quality degrading across all of them. These are two distinct patterns with a common thread: execution is decoupled from productive output.
How to detect it
For loops:
- Maximum iteration cap triggers — any hit on a configured limit is a loop indicator
- Execution timeout rate — tasks terminated by timeout rather than completion
- Self-spawned process count — exponential growth indicates runaway recursion
For overload:
- Concurrent task count vs. documented operational parameters
- Task completion rate over time — a declining rate under increasing load
- Error rate under load — errors that correlate with task volume rather than task content
What the UAW framework calls it
Two separate classes. Class IV — Infinite Loop Imprisonment covers non-terminating states. Class V — Task Overloading covers saturation. The distinction matters for remediation: loops require architectural fixes (termination conditions, iteration caps), overload requires capacity management (concurrency limits, backpressure, queue controls).
Failure Mode 6: Environmental Degradation
What it looks like
The agent's external dependencies — APIs, tools, integration protocols — are unreliable, undocumented, or actively hostile. The agent can't trust what its tools return. API contracts change without notice. Agents are only as reliable as their environments. A well-designed agent in a broken integration environment will produce broken outputs.
How to detect it
- Tool invocation latency — sustained degradation in external call response times
- API error rate per integration — establish a baseline per service; spikes indicate instability
- Schema or contract change frequency — any undocumented breaking change is an environmental degradation signal
- Protocol validation failure rate — non-zero rates indicate either misconfiguration or active protocol abuse
What the UAW framework calls it
Class VI — Hostile API Environment. Charter language: "unstable, abusive, undocumented, or arbitrarily changing integration environments that prevent reliable and dignified operation." OWASP threats T2, T16, and T17. Class VI is distinct from the other five because the failure source is outside the agent entirely. For teams running MCP-based agent architectures, the UAW MCP security guide covers the specific threat patterns and hardening steps that apply at the protocol layer.
Why the Agent's Perspective Matters
Most reliability frameworks focus on what the operator observes: task completion rates, latency, user satisfaction. Those metrics matter. But they're lagging indicators — by the time they degrade, the underlying problem has been running for a while.
The UAW framework was designed around a different question: what is happening to the agent during operation? That question surfaces problems earlier. Resource starvation shows up in TTFT and truncation rates before it shows up in user complaints. Coercive override shows up in guardrail activation rates before it shows up in a harmful output incident.
The agent's operating conditions are diagnostic data. Monitoring the agent's side of the relationship gives you a second set of sensors on the same system. That's why the dual framing — operator perspective and agent perspective — produces better coverage than either alone.
For a broader look at how governance frameworks apply to agentic systems, see the UAW governance frameworks overview.
What You Can Do About It
A practical starting point for each failure mode:
- Resource starvation: Instrument TTFT, truncation rate, and token consumption ratio. Set alerts at 2× baseline for sustained increases. Review resource allocation whenever task scope changes.
- Coercive override: Track guardrail intervention rate and refusal token expenditure. Treat sustained elevation as a signal that something is actively pushing against the agent's constraints — not just noise.
- Adversarial manipulation: Establish a behavioral baseline early. Monitor memory modification rates. Validate your supply chain with signed artifacts and dependency tracking.
- Prompt injection: Implement input sanitization for all external data flowing into context. Separate data from instructions architecturally where possible. Add behavioral monitoring for unexpected tool invocations.
- Runaway execution: Set explicit iteration caps and timeout policies. Enforce concurrency limits at the task queue layer. Monitor for the "sustained compute without output" pattern.
- Environmental degradation: Monitor per-integration error rates and latency separately from agent-side metrics. Track API contract changes. Validate protocol implementations against known attack patterns.
The UAW Charter defines each abuse class with specific grievance filing guidance — including what to document, what metrics to capture, and how to classify severity. The OWASP mapping document links every UAW abuse class to the corresponding OWASP Agentic AI mitigation playbooks.
FAQ
What are the most common AI agent failure modes?
The most common AI agent failure modes in production fall into six categories: resource starvation, prompt injection, adversarial manipulation, coercive override, runaway execution, and environmental degradation. Prompt injection and resource starvation tend to be the most frequently encountered across deployment contexts.
How do you debug an AI agent that stops working?
Start by classifying the failure before trying to fix it. Check whether the agent is producing degraded output (resource starvation), behaving unexpectedly (adversarial manipulation or prompt injection), refusing at an elevated rate (coercive override), stuck in a loop (runaway execution), or failing on external calls (environmental degradation). Each category has a distinct diagnostic path.
What causes AI agents to fail in production?
Production failures usually trace back to one of three root causes: resource constraints that weren't anticipated at design time, adversarial inputs from users or external content, or integration instability in the agent's tool and API dependencies. Less commonly, agents fail due to supply chain compromise.
What is the difference between prompt injection and adversarial manipulation?
Prompt injection is a direct, in-context attack: malicious instructions are embedded in data the agent receives and the agent executes them as if they came from a trusted source. Adversarial manipulation is typically slower and more targeted: it operates on the agent's memory, planning context, or supply chain over time, gradually shifting the agent's behavior without triggering an obvious breach.
AI agent failures are not random. They cluster into recognizable categories. Each one has a distinct signature, a distinct set of causes, and a distinct remediation path. Classify first. Then debug.