The challenge
Support team of 12 fielding 4,000+ WhatsApp messages a week. ~60% of those messages were repeatable: "what's my balance," "I have no service," "send my last bill." But none of them fit a chatbot — customers expected natural conversation in Spanish, with regional slang, often mid-message context switches.Our solution
A Claude-based agent with four tools: `lookup_account`, `get_billing_status`, `report_outage`, and `escalate_to_human`. System prompt enforced strict scope — refuse anything off-topic, hand off to a human within two turns if the agent was unsure. Evals: 80 hand-curated test cases across the four buckets, run on every prompt change.The results
After 30 days in production: 40% of inbound tickets resolved end-to-end by the agent. Median time-to-first-response dropped from 14 minutes to 6 seconds. Customer satisfaction (CSAT) on AI-resolved tickets matched human-resolved tickets within ±2 points. Support team morale up — they now spend time on the hard cases.What we shipped vs. what we didn’t
Shipped: Account lookup, balance check, outage reporting, escalation. Strict refusal of anything outside scope (with logging, so the product team can see what customers were trying to do that we didn’t support yet).
Didn’t ship in v1: Anything that mutates account state — no payments, no plan changes, no service activations through the agent. Read-only first, mutations later, with explicit confirmation. We were not going to be the team that got AI to turn off someone’s power.
The eval discipline
80 cases across four buckets (correct-answer, must-refuse, must-escalate, must-stay-in-voice). Cases lived as a flat JSON file in the same repo as the prompt, ran via npm run evals on every PR, and a GitHub Action blocked the merge if any bucket regressed more than 10%. This was the unsexy work that made the difference between a demo and a production system.