Production Observability

Telemetry Active

AI implementation is software engineering. Without hard observability, strict evaluations (evals), and schema validation, LLMs are too brittle for enterprise production. This page outlines the standard telemetry, dynamic routing benchmarks, and SLA metrics used across Frobert's autonomous agent deployments in the Danish mid-market.

Zod Pass Rate 99.8% Zero unvalidated DB writes
TTP (Median) 4.5 wk Time-to-Production (n=12)
Eval Coverage 100% CI/CD Regression Blocked
HITL Routing 1.4% Escalated to Human (Zendesk)

01. Dynamic Model Routing Benchmarks

We do not rely on a single model. Production agents use dynamic routing based on task complexity, required latency, and token cost. (Metrics derived from aggregated Q1 2026 telemetry).

Model Family Primary Architecture Role P50 Latency Fallback Guardrail
Claude Sonnet 4.6 Primary Reasoning Engine. Used for complex triage, code generation, schema conformity, and multi-step agentic workflows. ~840ms Human-in-the-Loop
Gemini 1.5 Flash High-Volume Extraction. Used for massive document parsing (PDFs, OCR), multimodal intake, and semantic search vectorization. ~320ms Claude Sonnet 4.6
GPT-5.4 / 4o Fallback & Voice. Maintained for specific Azure/OpenAI enterprise compliance environments and low-latency Voice APIs. ~950ms Claude Sonnet 4.6

02. Catching Hallucinations via Zod Validation

A common misconception is that AI agents frequently "hallucinate" bad data into databases. In a properly architected system, an LLM never writes directly to a database. It outputs JSON, which is passed through a strict schema validator (like Zod) in TypeScript. If the schema fails, the agent automatically repairs itself before proceeding.

Langfuse / Datadog Trace (req_9a4f2)
[10:42:01.001]REQ POST /api/process_invoice_webhook0ms
[10:42:01.050]LLM Prompting claude-4-6-sonnet with PDF payload...49ms
[10:42:01.850]RES Model Responded (2451 in / 112 out)800ms
[10:42:01.855]FAIL ZodError: Invalid type at "tax_amount". Expected number, received string ("Unknown").5ms
[10:42:01.860]RETRY Auto-repair initiated. Re-prompting LLM with ZodError context...5ms
[10:42:02.710]RES Model Responded850ms
[10:42:02.715]PASS Zod schema validated successfully. ("tax_amount": 0.00)5ms
[10:42:02.790]EXEC ERP API D365 Insert Successful75ms

Result: 99.8% Schema Pass Rate in production environments. The 0.2% that cannot be automatically repaired are escalated to a human operator via Zendesk (HITL). Zero hallucinated schemas reach the database layer.

03. CI/CD & LLM Evaluations (Evals)

Deploying a prompt update without running Evals is equivalent to deploying code without unit tests. Because LLMs are non-deterministic, Frobert's CI/CD pipelines require a passing Eval suite before code reaches the edge.

04. Time-to-Production (TTP) Benchmarks

Because we use standardized agentic frameworks (Vercel AI SDK), strict IaC (Terraform/Wrangler), and edge deployments, our time-to-production (TTP) significantly undercuts traditional enterprise IT integrators.

Provider Category Median TTP Data Source
Traditional IT Integrators 16–24 weeks Industry Standard (2024)
Internal Enterprise IT 20–30 weeks Industry Standard (2024)
Frobert (Independent) 3–6 weeks Internal Telemetry (n=12)