·6 min read·The PayGraph Team

Policy engines vs LLM guardrails: where each fails

LLM guardrails catch hallucinations at the prompt layer. Policy engines catch behaviors at the action layer. Here's why production agents need both.

Every team shipping an autonomous agent eventually hits the same question: should we put the safety logic in the prompt, in a guardrails library, or in a policy engine that sits in front of the tools? The answer is all three, and they catch different failures.

What is the difference between a policy engine and an LLM guardrail?

An LLM guardrail is a probabilistic filter on text. It runs before or after a model call and decides whether the input or output is acceptable. Tools like NeMo Guardrails, Guardrails AI, and Llama Guard live in this category. They classify, redact, or reroute language.

A policy engine is a deterministic gate on actions. It runs before a tool executes and decides whether the action is allowed under rules expressed in code. PayGraph is a policy engine: it intercepts tool calls, evaluates a policy, routes for approval if needed, and writes to an audit log.

Guardrails operate on tokens. Policy engines operate on side effects. The distinction matters the moment your agent can move money.

Where do LLM guardrails fail?

Prompt-layer guardrails fail in three predictable ways when actions are involved.

First, they're probabilistic. A jailbreak classifier with 99% recall still misses one in a hundred adversarial inputs. That is fine for a chatbot. It is not fine for an agent with a corporate card.

Second, they don't see the action. A guardrail can verify that the model's reasoning text looks reasonable. It cannot verify that the resulting make_payment(amount=12000, vendor="acme") call respects your weekly budget, because the budget isn't in the prompt — it's in your finance system.

Third, they're easy to bypass with tool-use indirection. The model says "I'll check the vendor list," calls a tool that returns attacker-controlled data, and the next tool call is a payment to a new vendor. The guardrail saw clean text the whole time. The action was the attack.

This is why the failure modes that matter for agent spending — runaway loops, prompt-injected payments, scope creep — show up most clearly in posts about stopping AI agents from overspending. They aren't language failures. They're action failures.

Where do policy engines fail?

Policy engines have a smaller failure surface, but the surface is real.

They can't catch what they can't see. If a tool isn't wrapped, calls to it bypass the engine. A junior engineer adds a new refund tool, forgets to register it, and the policy is silent. This is solvable with linting and code review, but it's not free.

They can't reason about intent. A policy says max_per_transaction_usd=500 and the agent splits a $2,000 purchase into four $500 charges. The engine approves all four because each is within policy. You need rate limits and aggregate caps to catch this — which is why daily and weekly windows belong in any serious policy and why thinking carefully about budget limits matters more than picking a single per-transaction number.

They can't fix bad rules. If your policy allows the wrong vendors, the engine will faithfully approve payments to the wrong vendors. The engine enforces; it does not author. Garbage rules in, garbage approvals out.

When do you need both?

Any agent that takes actions on real systems needs both layers. Here's the split:

LLM guardrailPolicy engine
LayerPrompt / outputTool call
DeterminismProbabilisticDeterministic
CatchesHallucinations, PII leaks, jailbreak textOver-budget spend, unapproved vendors, scope violations
MissesTool-use side effects, aggregate behaviorAnything that doesn't trigger a tool
Failure modeFalse negatives on adversarial inputUnwrapped tools, weak rules
AuditabilityLogs of model I/OImmutable log of every attempted and executed action
When it runsBefore/after model callBefore tool execution

The mental model: guardrails decide what the model is allowed to say. Policy engines decide what the agent is allowed to do. A jailbroken model that says something offensive is a content problem. A jailbroken model that wires $50,000 to an attacker is a balance-sheet problem.

How do the two layers compose in practice?

A production agent stack looks like this. The model receives an input that has passed through an input guardrail (PII scrubbing, jailbreak detection). It reasons and proposes a tool call. The tool is wrapped by a policy engine, which evaluates the call against deterministic rules, routes to a human if the rule says so, and writes the outcome to an audit log. The model's text response, separately, passes through an output guardrail before it reaches the user.

In code, the policy layer for spending looks like this:

from paygraph import PolicyEngine, Policy
 
policy = Policy(
    max_per_transaction_usd=500,
    daily_cap_usd=2000,
    weekly_cap_usd=8000,
    allowed_categories=["software", "ads"],
    allowed_vendors_source="https://internal.example/vendors",
    require_approval_above_usd=100,
)
 
engine = PolicyEngine(policy)
 
@engine.guarded_tool
def make_payment(amount_usd: float, vendor: str, category: str):
    # your Stripe Issuing / x402 / internal API call
    ...

The guardrail layer wraps the model call. The policy layer wraps the tool. Neither replaces the other. A jailbreak that survives the input guardrail still hits a hard stop at the policy engine if it tries to spend $50k. An over-budget request that the model produces with perfectly clean text still gets rejected because the rule is on the action, not the language.

Compliance teams care about this distinction because audit answers must be deterministic. "The classifier scored 0.91" is not an answer for SOC 2. "Transaction blocked by rule daily_cap_usd=2000 at 14:32 UTC, logged with hash 0xabc..." is.

Where to start

  • GitHub: github.com/paygraph-ai/paygraph — MIT-licensed policy engine for agent spending, designed to compose with whatever LLM guardrail stack you already run.
  • Docs: docs.paygraph.dev — policy reference, approval webhooks, audit log schema, and integration recipes for LangGraph and CrewAI.
  • Discord: discord.gg/PPVZWSMdEm — ask the team how to slot a policy engine in alongside your existing guardrails.

If your agent can spend, log in, or send mail, the prompt is not the right place to put the safety logic. Put it where the action happens.