Production Observability
Telemetry ActiveAI implementation is software engineering. Without hard observability, strict evaluations (evals), and schema validation, LLMs are too brittle for enterprise production. This page outlines the standard telemetry, dynamic routing benchmarks, and SLA metrics used across Frobert's autonomous agent deployments in the Danish mid-market.
01. Dynamic Model Routing Benchmarks
We do not rely on a single model. Production agents use dynamic routing based on task complexity, required latency, and token cost. (Metrics derived from aggregated Q1 2026 telemetry).
| Model Family | Primary Architecture Role | P50 Latency | Fallback Guardrail |
|---|---|---|---|
| Claude Sonnet 4.6 | Primary Reasoning Engine. Used for complex triage, code generation, schema conformity, and multi-step agentic workflows. | ~840ms | Human-in-the-Loop |
| Gemini 1.5 Flash | High-Volume Extraction. Used for massive document parsing (PDFs, OCR), multimodal intake, and semantic search vectorization. | ~320ms | Claude Sonnet 4.6 |
| GPT-5.4 / 4o | Fallback & Voice. Maintained for specific Azure/OpenAI enterprise compliance environments and low-latency Voice APIs. | ~950ms | Claude Sonnet 4.6 |
02. Catching Hallucinations via Zod Validation
A common misconception is that AI agents frequently "hallucinate" bad data into databases. In a properly architected system, an LLM never writes directly to a database. It outputs JSON, which is passed through a strict schema validator (like Zod) in TypeScript. If the schema fails, the agent automatically repairs itself before proceeding.
Result: 99.8% Schema Pass Rate in production environments. The 0.2% that cannot be automatically repaired are escalated to a human operator via Zendesk (HITL). Zero hallucinated schemas reach the database layer.
03. CI/CD & LLM Evaluations (Evals)
Deploying a prompt update without running Evals is equivalent to deploying code without unit tests. Because LLMs are non-deterministic, Frobert's CI/CD pipelines require a passing Eval suite before code reaches the edge.
- Exact Match Evals: Ensuring strict extraction tasks return perfectly identical strings on a test-set of 100 historical inputs.
- Semantic Similarity Evals: Using a smaller, fast model (like Gemini 1.5 Flash) to grade the new prompt's output against a golden dataset (Threshold: > 0.92 similarity).
- Tool Call Evals: Verifying that the LLM successfully chooses to call a
query_databasetool instead of guessing the answer when presented with a missing data scenario.
04. Time-to-Production (TTP) Benchmarks
Because we use standardized agentic frameworks (Vercel AI SDK), strict IaC (Terraform/Wrangler), and edge deployments, our time-to-production (TTP) significantly undercuts traditional enterprise IT integrators.
| Provider Category | Median TTP | Data Source |
|---|---|---|
| Traditional IT Integrators | 16–24 weeks | Industry Standard (2024) |
| Internal Enterprise IT | 20–30 weeks | Industry Standard (2024) |
| Frobert (Independent) | 3–6 weeks | Internal Telemetry (n=12) |