How We Cut Latency by 50% by Simplifying Our Agentic Architecture

ChatGenie Engineering

January 22, 2026 2:04 PM

From Five Agents to Two: A Lesson in Principled Simplification

The Latency Problem in Agentic Systems

When we first designed ChatGenie's agentic system for customer chat operations, we followed a principle that seemed intuitive: separate concerns into separate agents. Intent classification? That's one agent. Policy enforcement? Another agent. Response generation? Yet another.

The result was a five-agent core chain that was clean, modular, and easy to reason about. It was also slow.

Each agent in the chain required a separate LLM call. Five agents meant five round-trips to the model. In customer chat, where users expect near-instant responses, this cumulative latency was becoming a problem. Users would see typing indicators for seconds before receiving a response. Containment rates suffered as impatient users escalated to human agents.

We needed to rethink our architecture.

The Insight: Anthropic's "Simplicity First" Principle

The turning point came when we revisited Anthropic's Building Effective Agents guide (December 2024). One passage stood out:

"When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense."

This forced us to ask a hard question: Were all five of our agents actually necessary?

The answer, it turned out, was no.

Our Original Architecture: The Five-Agent Core Chain

Our initial design separated the agentic workflow into five distinct agents, each with a specific responsibility:

1. Intent Agent: Classified the user's request and extracted required fields (tracking numbers, dates, etc.)

2. Guard Agent: Enforced policy and safety constraints, determined allowed tools and data visibility

3. Orchestrator Agent: Decided what steps to take and which tools to call

4. Conversation Agent: Generated the customer-facing response using tool outputs and retrieved context

5. Supervisor Agent: Quality gate that checked groundedness, policy alignment, and tone before sending

*5 separate LLM calls = cumulative latency*

This architecture was conceptually clean. Each agent had a single responsibility. Debugging was straightforward because we could trace exactly which agent made which decision. But the five sequential LLM calls created unacceptable latency for real-time chat.

Below is an actual screenshot of ChatGenie’s Response Breakdown tool showing the output of the Five-Agent Core Chain Workflow. The Supervisor Agent having tasked to reflect on drafted responses are adding unnecessary latency to generate the chatbot response:

The Realization: Not All Separation Serves a Purpose

When we analyzed our agent chain, we realized something important: the Intent Agent, Conversation Agent, and Supervisor Agent were performing tasks that could be phases within a single reasoning process, not fundamentally different operations requiring separate models.

Think about how a skilled human support agent works. They don't classify the intent, then switch to a different mental mode to plan their response, then switch again to write it, then switch once more to review it. They do all of this in one continuous thought process.

The question became: which separations are architecturally necessary, and which are just conceptual conveniences?

One separation stood out as truly essential: the Guard Agent. Guardrails must execute before any response generation, not after. This is a security boundary that should never be optimized away. If the orchestrator generates a response and then the guard rejects it, you've wasted compute and introduced risk. The guard must gate the process upfront.

The Streamlined Architecture: Two-Agent Design

We consolidated the Intent Agent, Orchestrator Agent, Conversation Agent, and Supervisor Agent into a single, unified Orchestrator Agent. The Guard Agent remained separate.

The new Orchestrator Agent handles four phases in a single LLM call:

Classify: Understand the user's intent and extract entities
Plan: Determine which tools to call and in what order
Respond: Generate the customer-facing message
Validate: Self-check for groundedness, policy alignment, and appropriate tone

Why the Guard Agent Remains Separate

We intentionally kept the Guard Agent as a distinct component. This wasn't an oversight—it was a deliberate architectural decision based on security principles:

Pre-execution filtering: Guardrails must run before the orchestrator generates any response. If we embedded guardrails into the orchestrator, a prompt injection or policy violation could occur before the guard logic executes.
Security boundary: The Guard Agent can use different model parameters, stricter temperature settings, or even a different model optimized for safety classification.
Independent audit: Keeping guardrails separate means we can audit and improve them independently without touching core business logic.
Fail-safe behavior: If the Guard Agent fails or times out, the system can halt safely. If guardrails were embedded, a failure might still produce an unvetted response.

Below is an actual screenshot of ChatGenie’s Response Breakdown tool showing the output of Two-Agent Core Chain Workflow. The Response Breakdown output is much simpler hence latency is greatly reduced:

Results: What We Gained

This architectural change has been in production since Q4 2025. The results:

Metric	Outcome
Response latency	Reduced by over 50%
Accuracy (eval set)	Unchanged at 98%
LLM API costs	Reduced (fewer API calls)
Debugging complexity	Simplified (fewer components to trace)
Guardrail coverage	Unchanged (Guard Agent preserved)

The key insight: consolidation did not mean elimination. Intent classification, response generation, and quality validation still happen—they just happen within a single, well-structured LLM call rather than across multiple separate calls.

Below is an actual screenshot of the Evaluation Tests from the old Five-Agent Core Chain to the current Two-Agent Core Chain. All tests except for the initial Evaluation Tests have an average of 98% accuracy rate:

A Note on Model Selection

We use GPT-4o for both the Guard Agent and the Orchestrator Agent. We experimented with GPT-4o-mini for the Guard Agent (reasoning that safety classification might be a simpler task), but found performance degradation that wasn't acceptable for a security-critical component.

This aligns with Anthropic's guidance: "Set up evals to establish a performance baseline... focus on meeting your accuracy target with the best models available... optimize for cost and latency by replacing larger models with smaller ones where possible." We tried the smaller model, measured the results, and made a data-driven decision to stick with the more capable one.

What We Preserved

It's important to emphasize that streamlining the architecture didn't mean removing capabilities:

Capability	Before	After
Intent classification	Separate Intent Agent	Phase within Orchestrator
Policy enforcement	Separate Guard Agent	Separate Guard Agent
Tool orchestration	Separate Orchestrator Agent	Core Orchestrator function
Response generation	Separate Conversation Agent	Phase within Orchestrator
Quality validation	Separate Supervisor Agent	Phase within Orchestrator

The functions remain; the boundaries changed.

Trade-offs and Future Flexibility

We're transparent about what we traded away:

Per-task model selection: With separate agents, we could use a smaller, faster model for intent classification and a more capable model for response generation. In the consolidated architecture, one model handles everything.
Granular observability: With five agents, we could measure latency, accuracy, and failure rates for each component independently. Now we observe the orchestrator as a unit.
Independent iteration: Previously, we could improve the Conversation Agent's tone without touching intent classification. Now, prompt changes affect the entire orchestrator.

However, our architecture is designed for future flexibility. We're exploring adding intent classification as a sub-agent in scenarios where it makes sense. This would work as a tool call within the orchestrator—the orchestrator could invoke an intent classification tool (potentially running on a smaller, faster model like GPT-4o-mini or Claude Haiku) before proceeding with its main reasoning.

This gives us the best of both worlds: the default path is fast (single orchestrator call), but we can selectively add sub-agents for specific use cases that benefit from specialized models.

When to Use Each Architecture

Based on our experience, here's guidance on when each approach makes sense:

Use the streamlined two-agent architecture when:

Latency is critical (real-time chat, customer-facing applications)
Intent categories are well-defined and bounded
You're optimizing for cost (fewer API calls = lower spend)
Debugging simplicity matters more than granular observability

Consider the distributed five-agent architecture when:

You need different models for different tasks (cost optimization via model tiering)
Intent classification is complex and benefits from a specialized fine-tuned model
You need detailed per-agent metrics for compliance or debugging
Latency is less critical (batch processing, async workflows)

Conclusion: Simplicity as a Design Principle

Anthropic's guidance proved correct: the simplest solution that meets your requirements is usually the best one. We started with five agents because it felt architecturally "clean." But cleanliness in design doesn't always translate to performance in production.

By consolidating four agents into one orchestrator while preserving the Guard Agent as a security boundary, we achieved:

50%+ latency reduction
98% accuracy maintained
Lower API costs
Simpler debugging
Preserved security guardrails

The lesson isn't "fewer agents are always better." The lesson is: question every boundary in your architecture. Ask whether each separation serves a genuine purpose—security, compliance, model optimization—or whether it's just conceptual tidiness.

Sometimes the most elegant architecture is the one with fewer boxes on the diagram.

Thinking about AI automation for your customer operations?

We've deployed agentic systems that reduced support OPEX by 77% while maintaining 98% accuracy. Whether you're exploring your first AI pilot or scaling an existing implementation, our team can help you avoid the pitfalls we've already solved.

📅 Book a call with us: https://chatgenie.ph/book-a-call

Let's talk about what's possible for your workflow.

‍

Reference: This article draws on principles from Anthropic's Building Effective Agents (Erik Schluntz and Barry Zhang, December 2024), available at anthropic.com/research/building-effective-agents

Back to Blog