Version 1.0 | Streamlined Architecture
TL;DR: The Five Pillars of This Baseline
- Risk-Tiered Decision Boundaries: Categorize every intent as Low, Medium, or High risk. Automate low-risk fully, gate medium-risk with verification, escalate high-risk to humans.
- Two-Agent Streamlined Core Chain: Guard Agent (policy enforcement) → Orchestrator Agent (classify + plan + respond + validate). Optimized for enterprise latency requirements.
- Read-First, Write-Later Tool Philosophy: Start with read-only tools in Phase 1. Introduce write actions only in Phase 3, with verification gates and audit trails.
- Evals Before Launch: Build a golden dataset, define accuracy thresholds, and get sign-off metrics before going live. No evals = no production.
- Human Escalation as a Feature, Not a Failure: Escalation triggers are configurable guardrails that protect the business and the customer. Design for graceful handoffs.
Introduction
This document proposes a baseline standard for the minimum requirements of an end-to-end agentic system for customer chat operations. It is not intended as a complete specification, but as a starting point for teams building production-grade agentic support systems.
We use a parcel delivery platform as the reference implementation throughout this document. This domain is ideal for illustrating agentic principles because it involves high volume, time-sensitive inquiries, repetitive intents with edge cases, and a mix of read-only and write operations with varying risk levels.
This baseline draws on principles from established industry guides, including Anthropic's Building Effective Agents and OpenAI's A Practical Guide to Building Agents. We adapt these general frameworks specifically for customer chat operations in enterprise environments, where safety, compliance, latency, and operational predictability are paramount.
Understanding AI Agents vs. Workflows
Before diving into the baseline, it's important to clarify the distinction between AI agents and workflows. These terms are often used interchangeably, but they represent fundamentally different architectural approaches.
This baseline uses a hybrid approach: a workflow pattern (sequential, deterministic orchestration) combined with agentic capabilities (LLM-powered reasoning, dynamic tool usage within bounded phases). This gives us the predictability enterprises require while retaining the flexibility that makes AI valuable.
We chose this pattern because customer chat operations demand predictability (every inquiry follows the same processing path), auditability (we can trace exactly what happened), and latency control (fixed chains allow for optimized execution).
When Agentic Systems Are Needed (and When They're Not)
Building an agentic system is not always the right answer. Anthropic's Building Effective Agents guide emphasizes this point:
"When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense."
Before investing in agentic architecture, validate that your use case genuinely requires it.
This baseline is appropriate when:
- High volume of repetitive intents: You handle thousands of similar inquiries (status checks, FAQs, scheduling) where automation delivers clear ROI.
- Complex decision-making with nuanced judgment: Workflows involve exceptions, context-sensitive decisions, or cases where traditional if-then rules have become unwieldy.
- Unstructured data interpretation: You need to extract meaning from natural language, images, or documents rather than structured form inputs.
- Backend integration requirements: Resolving inquiries requires looking up data from multiple systems (order management, CRM, logistics) in real-time.
This baseline may be overkill when:
- Simple FAQ retrieval: If your support is purely informational with no backend lookups, a basic RAG system or static chatbot may suffice.
- Rigid, well-defined rules with no exceptions: If every case follows a deterministic flowchart, a rule-based system will be faster, cheaper, and more predictable.
- Low volume or high-stakes-only interactions: If you handle <100 inquiries/day or every case requires human judgment, the complexity of an agentic system may not pay off.
- No tolerance for probabilistic behavior: If your domain requires 100% deterministic responses (certain regulatory compliance scenarios), LLM-based agents introduce unacceptable risk.
Scope and Assumptions
In Scope:
- Multi-turn conversational chat support
- Intent classification and routing
- Safe, grounded response generation
- Retrieval of policy and FAQ knowledge
- Human escalation pathways
- Traceability and evaluation
Out of Scope:
- Full autonomous resolution for high-liability issues
- End-to-end identity verification across all channels
- Voice support
- Email SLA workflows
Reference Use Case: Parcel Delivery Platform
Personas served:
- Sender/shipper
- Recipient
- Courier/rider
- Merchant support staff
Risk Tiers and Decision Boundaries
Every customer intent maps to a risk tier that determines how much automation is permitted.
Human Escalation Triggers:
- Intent confidence below threshold
- Retrieval confidence low or no relevant KB match
- Sensitive keywords detected (legal, threat, fraud)
- Repeated failures or user frustration
- Tool or API failures
- Negative sentiment spike
The Streamlined Core Chain: Two-Agent Architecture
Based on enterprise latency requirements and the principle of finding the simplest solution possible, we have streamlined our core agentic workflow from five agents to two. This consolidation reduced latency by over 50% while maintaining 98% accuracy.
The key insight: not all agent boundaries serve an architectural purpose. Intent classification, response generation, and quality validation can be phases within a single reasoning process rather than separate agents requiring separate LLM calls. However, the Guard Agent boundary is architecturally essential—guardrails must execute before any response generation, not after.
.png)
1. Guard Agent (Pre-Execution Security Gate)
Purpose: Enforce policy and safety constraints before any response generation.
Why separate: Guardrails must run before the orchestrator generates any response. If guardrails were embedded, a prompt injection could execute before safety checks run. This is a security boundary that should never be optimized away.
Outputs:
- Allowed tools for this request
- Data visibility rules (PII masking)
- Escalation rules (if any)
- Pass/fail decision
2. Orchestrator Agent (Unified Processing)
Purpose: Handle intent classification, planning, response generation, and quality validation in a single, well-structured LLM call.
Four phases within one call:
- Classify: Understand the user's intent, extract entities (tracking numbers, dates, addresses), assign risk tier
- Plan: Determine which tools to call and in what order based on Guard Agent permissions
- Respond: Generate the customer-facing message using tool outputs and retrieved context
- Validate: Self-check for groundedness, policy alignment, completeness, and appropriate tone
Outputs: Final response to customer OR escalation decision OR request for clarification
Why Two Agents, Not Five (or One)
We originally designed a five-agent chain: Intent Agent, Guard Agent, Orchestrator Agent, Conversation Agent, and Supervisor Agent. This was conceptually clean but introduced cumulative latency that exceeded enterprise requirements.
Following Anthropic's guidance to "find the simplest solution possible," we analyzed which boundaries were architecturally necessary versus merely organizational:
The Guard Agent remains separate because:
- Pre-execution filtering: Must run before response generation, not after
- Security boundary: Can use different model parameters optimized for safety
- Independent audit: Can be tested and improved without touching business logic
- Fail-safe behavior: If Guard fails, system halts safely
Tooling and Action Surface Area
Tools are how agents interact with the outside world. We categorize tools by risk level and gate access based on the Guard Agent's permissions.
Tool Design Principles:
- Less is more: Agents struggle with many overlapping tools. Start with 5-8 well-defined tools.
- Clear naming: Use descriptive names like get_shipment_status, not fetch_data.
- Namespace when scaling: Group by domain: shipment_get_status, shipment_reschedule.
- Test before connecting: Many agent failures are actually tool failures in disguise.
- Safety requirements: Least privilege, idempotency, timeouts with safe fallbacks, full audit trail.
Identity and Authorization
Roles:
- Sender/shipper
- Recipient
- Courier/rider
- B2B merchant account
Verification Methods:
- OTP to registered phone
- Confirm delivery postcode / shipping details
- Logged-in session token
Baseline Rule: Read-only intents can be handled with low identity confidence. Write actions require explicit verification and eligibility checks.
Rollout Phases
Following the principle of starting simple and adding complexity only when needed, we recommend a phased rollout:
Phase 0: Foundations
- Build golden evaluation dataset
- Define risk tiers and decision matrix
- Implement trace logging
- Establish sign-off metrics (accuracy thresholds, latency targets)
Phase 1: Low-Risk Containment
- Deploy Guard Agent + Orchestrator Agent for tracking status, ETA, FAQs
- Read-only tools only
- Clear escalation rules for anything outside low-risk
.png)
Phase 2: Human Escalation + Triage
- Add escalation handling capabilities
- Auto-categorization and priority assignment
- CSAT/sentiment tracking
- Handoff summarization for human agents
.png)
Phase 3: Controlled Actions via Backend Tools
- Enable limited write actions (reschedule, address change, claim initiation)
- Verification gates required
- Tool permission matrix enforced by Guard Agent
- Full audit trail for all write operations
.png)
Success Criteria
Worked Example: Low-Risk Trace
User message: "Where is my parcel? Tracking number ABC123"
- Guard Agent: Evaluates request. No policy violations detected. Allows read tools. Masks PII in responses. Passes to Orchestrator.
- Orchestrator Agent - Classify: Intent = TRACK_STATUS, confidence = 0.95, risk = LOW, entity = tracking_number: ABC123
- Orchestrator Agent - Plan: Call get_shipment_status(ABC123), get_last_scan_event(ABC123), get_eta(ABC123)
- Tool Execution: Returns status = "In Transit", last_scan = "Sorting facility, 2hrs ago", ETA = "Tomorrow 2-6pm"
- Orchestrator Agent - Respond: Drafts customer-facing message with status, last scan, and ETA
- Orchestrator Agent - Validate: Checks groundedness (all facts from tools ✓), policy alignment (no overpromising ✓), tone (professional ✓)
- Send + Log: Response delivered. Full trace logged for audit.
Outcome: Auto-contained. 2 LLM calls total. Customer received accurate, grounded response.
Security, Privacy, and Compliance
- PII masking: Sensitive data masked in logs and responses per data retention rules
- Tenant isolation: For multi-tenant deployments, strict data separation
- Prompt injection resistance: Guard Agent filters malicious inputs before processing
- Full audit trails: Every tool call, every approval event logged with timestamps
- Transparency: Inform users when they are interacting with an AI system, where appropriate and required by policy
Known Limitations
- Long-tail intents and ambiguous policy requests may require human review
- Tool downtime and inconsistent backend data can affect response quality
- Fraud and disputes requiring strong verification remain human-reviewed
- Liability-heavy decisions should remain human-reviewed
- Cross-session memory not included in this baseline (stateless, single-session interactions)
Conclusion
This baseline standard provides a starting point for building production-grade agentic customer chat systems. The key principles:
- Start simple: Two agents (Guard + Orchestrator) meet most enterprise requirements
- Preserve security boundaries: The Guard Agent is non-negotiable
- Risk-tier everything: Automate low-risk, gate medium-risk, escalate high-risk
- Evals before production: No golden dataset, no launch
- Design for graceful degradation: Human escalation is a feature
This is version 0.2 of the baseline. We welcome feedback from practitioners building agentic systems in production.
Thinking about AI automation for your customer operations?
We've deployed agentic systems that reduced support OPEX by 77% while maintaining 98% accuracy. Whether you're exploring your first AI pilot or scaling an existing implementation, our team can help you avoid the pitfalls we've already solved.
📅 Book a call with us: https://chatgenie.ph/book-a-call
Let's talk about what's possible for your workflow.
Reference: This baseline draws on principles from Anthropic's Building Effective Agents (Erik Schluntz and Barry Zhang, December 2024) and OpenAI's A Practical Guide to Building Agents (2025). Available at anthropic.com/research/building-effective-agents


.png)
.png)
.jpg)




