Why Your MCP Agents Fail in Production (And How to Fix It)

MCP Architecture Problems: Context Window Saturation and Tool Result Noise

Niv Schneiderman

Jan 6, 2026

•

min read

Share this post

If you want LLM agents to do real automation, they need reliable access to external systems. Jira, GitHub, cloud control planes, SIEM, ticketing. The Model Context Protocol (MCP) helped by standardizing how tools wrap APIs so agents can call them.

The problem is not MCP as a protocol. The problem is the default architecture people build around it. Many implementations treat raw tool results as prompt content. No shaping, no projection, no deterministic compute. That saturates the context window, injects noise, and pushes exact questions into a probabilistic system. You get lower accuracy and higher cost.

Where MCP Tool Calling Breaks Down

The accuracy falls apart the moment you leave the toy cases. Take a basic question: "How many tasks does the platform group have in the db-migration project?"

A standard MCP agent goes through this sequence:

The model selects a tool that wraps a simple API like /list_issues
The runtime calls /list_issues
Body: {query: “project=db-migration AND group=platform”}
The MCP server returns every issue meeting this filter
The entire dataset gets injected into the model's context even though none of the issue fields is relevant
The model tries to count the items

The outcome is predictable:

Computational errors. The count is wrong because LLMs do not perform exact computation. This is the same flaw behind the strawberry problem.
Reduced accuracy from context noise. Accuracy drops because the context is now bloated with irrelevant data. As context size increases, accuracy decreases.
Increased cost. Large payloads sent back to the model as tool results spike token usage.

‍

Beyond Tool Calling: Executable Contracts

To move past these limitations, the agent architecture must shift filtering and aggregation work to a secure, deterministic runtime environment. The key insight is that LLMs are far more reliable at generating code than performing that processing themselves at inference time.

Since executing LLM-generated code is risky and error-prone, A more realistic and reliable approach is to have the model generate a workflow definition that the workflow engine can execute:

automation_type: on_demand
inputs:
  project:
    type: text
  group:
    type: text
steps:
  - action: jira.ListIssuesV2
    id: S1
    name: Get Jira Issues for Team and Sprint
    inputs:
      jql: project = "{{inputs.project}} AND project = {{inputs.group}}"
      fields: id
      maxResults: 1000
  - action: internal.SetVariables
    id: S2
    name: Set Issue Count Variable
    inputs:
      Variables:
        - Name: issue_count
          Type: Number
          Value: "{{len(steps.S1.output.issues)}}"
outputs:
  issue_count: "{{variables.issue_count}}"

‍

The model is not writing python script which calls external APIs / MCP servers. It is composing pre-built actions (jira.ListIssuesV2, internal.SetVariables) with validated inputs. The workflow engine handles everything the LLM should never touch: authentication, pagination, retries, error handling, and state management.

The resulting flow becomes:

Workflow generation. The model emits a structured workflow YAML using actions from the catalog.
Schema validation. The runtime validates the workflow against the action schemas.
Deterministic execution. The workflow engine executes each step using trusted, pre-built code. The count issues operation is done in code and not by the LLM.
Distilled result. The model receives only the final output (e.g., {"issue_count": 42}).

This approach works because generating structured YAML with known action names and input schemas is a much simpler task for an LLM than writing correct Python that handles authentication, state management, and error cases. The model stays in its strength zone (structured generation), while the workflow engine stays in its strength zone (reliable execution).

This shift is already being adopted. Cloudflare introduced a feature to their agent SDK called code mode, which fetches an MCP server's schema, converts it into a TypeScript API, and executes the generated code in a secure V8 isolate sandbox. Anthropic is moving the same direction with their code execution tool (currently in public beta) and their newer programmatic tool calling feature, which lets Claude write Python scripts that orchestrate entire workflows in a sandboxed environment rather than returning each tool result to the model. Both validate the same architectural shift: use secure code execution to handle deterministic computation and filtering, letting the LLM focus on reasoning and synthesis.

Ref: https://blog.cloudflare.com/code-mode/

Code Execution Is Not the Full Answer

Letting agents write their own code introduces two hard problems.

Security and platform maturity. Running model-generated code is not free of risk. You need a real sandbox, hard resource limits, and tight monitoring. Anthropic spells this out directly in their docs. If you do not have a secure execution layer, you cannot safely run anything the model produces. This becomes operational overhead and a real attack surface.

Code accuracy is still weak. LLMs get good results in synthetic coding tasks because those problems are tiny and self-contained. A recent study from Concordia University shows what happens when tasks resemble real-world software development.. Models score 84 to 89 percent on toy benchmarks, then drop to 25 to 34 percent when generating class level code from real projects. The failure rate is the same on familiar and unfamiliar codebases. Most issues come from broken attribute access, type mistakes, and wrong assumptions about the surrounding system. Once the task resembles actual software, reliability falls apart.

The implication is clear: if your agent architecture depends on the LLM generating correct Python or TypeScript to call APIs, handle pagination, manage auth tokens, and process responses, you are building on a foundation that fails 65 to 75 percent of the time on real-world tasks. That is not a production-ready system.

This is precisely why generating workflow YAML from an action catalog is more reliable than generating arbitrary code. The model does not need to know how to authenticate to Jira, handle pagination, or manage retries. It only needs to select the right action and fill in the parameters. You are shrinking the problem to something LLMs can actually do reliably.

The Production Framework: Action Catalogs and Auditable Workflows

Full code generation is where LLMs break, and the fix is to shrink the problem. Instead of letting the model produce open-ended code, you constrain it to structured building blocks that a workflow engine already knows how to execute safely.

The core solution is an action catalog: a library of pre-built, validated actions, each with a defined schema for inputs and outputs. The model assembles them and fills in parameters. The engine handles everything messy: authentication, error handling, retries, side effects, and integration behavior. It is far easier for the model to compose validated actions than to synthesize new logic from scratch.

The workflow engine ensures governance by implementing:

Policy checks. All actions are classified by their side effects (read vs. write). Read / data manipulation actions like a python script with no external connections can be auto-executed. Write actions like create_jira_ticket automatically trigger approval requirements.
Human review gates. Critical write actions require a human to approve the exact operation and parameters before execution, often through a clear no-code UI.
Auditable execution. Every action is logged and persisted in the customer’s data store.

This framework transforms risky agent autonomy into a governed system where logic is composed by the model but executed, secured, and validated by the platform.

What This Looks Like in Practice

At BlinkOps, we built the platform around these exact constraints. Agents reason through problems, but their actions execute via deterministic workflows. Agents do not interact with raw APIs. They compose operations from a library of pre-built actions that are auditable, authenticated, and operate within defined policy guardrails. The visual workflow editor provides the mandatory review step, translating any agent-generated logic into a transparent workflow before high-risk actions are committed.

Agentic platforms that are secure by design are the only ones capable of delivering agents that enterprise security teams will actually approve. The combination of a deterministic workflow engine, human-in-the-loop controls, and an architecture where agents hold no credentials turns agent autonomy from a risky idea into something operational teams can trust in production.

Learn more about Agentic Automation

See how BlinkOps brings secure agentic automation to your SOC.

See it in action

Expert Tip

Articles

6 Real-World Applications of AI in Cybersecurity

Explore six real-world AI applications in cybersecurity, including BurpGPT, Blink Copilot, VirusTotal Code Insight, and Tenable's EscalateGPT.

Company News

BlinkOps closes $50 million Series B funding round led by Eyal Ofer’s O.G. Venture Partners.

Funding brings total investment to $90 million as dozens of Fortune 500 customers adopt its No-Code Security Micro-Agents Builder

Articles

Weekly Workflow: I Never Knew My Google Drive Was at Risk

We don't often think of Google Drive as not being secure, but the truth of it is that it's just as prone to attacks as any other service when you're not careful about your document share settings.

Automate your security operations everywhere.

Blink is secure, decentralized, and cloud-native.  Get modern cloud and security operations today.

Get a Demo