Production Observability

Name: Frobert Production Telemetry (2026)
Creator: Frobert

Telemetry Active

AI implementation is software engineering. Without hard observability, strict evaluations (evals), and schema validation, LLMs are too brittle for enterprise production. This page outlines the standard telemetry, dynamic routing benchmarks, and SLA metrics used across Frobert's autonomous agent deployments in the Danish mid-market.

Zod Pass Rate 99.8% Zero unvalidated DB writes

TTP (Median) 4.5 wk Time-to-Production (n=12)

Eval Coverage 100% CI/CD Regression Blocked

HITL Routing 1.4% Escalated to Human (Zendesk)

01. Dynamic Model Routing Benchmarks

We do not rely on a single model. Production agents use dynamic routing based on task complexity, required latency, and token cost. (Metrics derived from aggregated Q1 2026 telemetry).

Model Family	Primary Architecture Role	P50 Latency	Fallback Guardrail
Claude Sonnet 4.6	Primary Reasoning Engine. Used for complex triage, code generation, schema conformity, and multi-step agentic workflows.	~840ms	Human-in-the-Loop
Gemini 1.5 Flash	High-Volume Extraction. Used for massive document parsing (PDFs, OCR), multimodal intake, and semantic search vectorization.	~320ms	Claude Sonnet 4.6
GPT-5.4 / 4o	Fallback & Voice. Maintained for specific Azure/OpenAI enterprise compliance environments and low-latency Voice APIs.	~950ms	Claude Sonnet 4.6

02. Catching Hallucinations via Zod Validation

A common misconception is that AI agents frequently "hallucinate" bad data into databases. In a properly architected system, an LLM never writes directly to a database. It outputs JSON, which is passed through a strict schema validator (like Zod) in TypeScript. If the schema fails, the agent automatically repairs itself before proceeding.

Langfuse / Datadog Trace (req_9a4f2)

[10:42:01.001]REQ POST /api/process_invoice_webhook0ms

[10:42:01.050]LLM Prompting claude-4-6-sonnet with PDF payload...49ms

[10:42:01.850]RES Model Responded (2451 in / 112 out)800ms

[10:42:01.855]FAIL ZodError: Invalid type at "tax_amount". Expected number, received string ("Unknown").5ms

[10:42:01.860]RETRY Auto-repair initiated. Re-prompting LLM with ZodError context...5ms

[10:42:02.710]RES Model Responded850ms

[10:42:02.715]PASS Zod schema validated successfully. ("tax_amount": 0.00)5ms

[10:42:02.790]EXEC ERP API D365 Insert Successful75ms

Result: 99.8% Schema Pass Rate in production environments. The 0.2% that cannot be automatically repaired are escalated to a human operator via Zendesk (HITL). Zero hallucinated schemas reach the database layer.

03. CI/CD & LLM Evaluations (Evals)

Deploying a prompt update without running Evals is equivalent to deploying code without unit tests. Because LLMs are non-deterministic, Frobert's CI/CD pipelines require a passing Eval suite before code reaches the edge.

Exact Match Evals: Ensuring strict extraction tasks return perfectly identical strings on a test-set of 100 historical inputs.
Semantic Similarity Evals: Using a smaller, fast model (like Gemini 1.5 Flash) to grade the new prompt's output against a golden dataset (Threshold: > 0.92 similarity).
Tool Call Evals: Verifying that the LLM successfully chooses to call a query_database tool instead of guessing the answer when presented with a missing data scenario.

04. Time-to-Production (TTP) Benchmarks

Because we use standardized agentic frameworks (Vercel AI SDK), strict IaC (Terraform/Wrangler), and edge deployments, our time-to-production (TTP) significantly undercuts traditional enterprise IT integrators.

Provider Category	Median TTP	Data Source
Traditional IT Integrators	16–24 weeks	Industry Standard (2024)
Internal Enterprise IT	20–30 weeks	Industry Standard (2024)
Frobert (Independent)	3–6 weeks	Internal Telemetry (n=12)