You can’t ship an LLM feature without evals, but you also can’t justify a $40k eval platform for a feature that handles forty queries a week. There’s a middle path. This is what we ship with by default at Bamboo.
The four-bucket test
For any LLM feature, we want to know:
- Did it get the right answer? Accuracy on a held-out set of known-good Q&A pairs.
- Did it refuse when it should have? Refusal rate on adversarial / out-of-scope inputs.
- Did it stay in voice? Tone/format adherence — usually an LLM-as-judge pass.
- Did it cost what we expected? Tokens per request, p50 and p95.
If any of these regress more than 10% between two releases, we don’t ship.
What “minimum viable” looks like
- 30–50 cases per bucket. Not 30,000. The first 30 catch 80% of regressions; the rest are diminishing returns until you have real users complaining.
- A flat JSON file per feature, committed to git. No database, no eval platform. Tracked alongside the prompt itself, because the eval set IS part of the prompt’s contract.
- A
npm run evalsscript that runs all buckets in parallel and prints a diff against the last committed run. We added a GitHub Action that runs it on every PR touching the prompt.
What we DON’T do at this stage
- Real-time eval dashboards
- Multi-judge consensus voting
- Statistical significance testing
- Anything that requires a frontend
These are valuable. They are also Phase 3 problems, not Phase 1 problems. Ship the feature, see if anyone uses it, then come back for the dashboard.
The one thing you can’t skip
The 30 hand-curated test cases. Not LLM-generated, not scraped from logs — you sit down with the product owner for two hours and write the cases they’d actually want to see pass. This is the most undervalued and most important step. Skip it and your evals will be measuring noise.