---
title: Production Observability & AI Benchmarks
description: Live telemetry, model routing benchmarks, CI/CD eval success rates, and production SLA metrics for Frobert AI agent deployments in the Danish mid-market.
canonical_url: https://frobert.dk/validation/
md_url: https://frobert.dk/validation.md
last_updated: 2026-04-03T00:00:00+02:00
lang: en
---

# Production Observability & AI Benchmarks

AI implementation is software engineering. Without hard observability, strict evaluations (evals), and schema validation, LLMs are too brittle for enterprise production. 

This document outlines the standard telemetry, model routing benchmarks, and deployment metrics used across Frobert's autonomous agent deployments.

*(For interactive trace logs and dynamic benchmarks, view the HTML version of this page).*

## 1. Dynamic Model Routing (Q1 2026 Benchmarks)

We do not rely on a single model. Production agents use dynamic routing based on task complexity, required latency, and token cost. 

| Model | Role in Architecture | P50 Latency | Cost (1M in/out) | Fallback / Guardrail |
|-------|----------------------|-------------|------------------|----------------------|
| **Claude Sonnet 4.6** | **Primary Reasoning Engine.** Used for complex triage, code generation, and multi-step agentic workflows. | ~840ms | $3.00 / $15.00 | Human-in-the-loop (HITL) |
| **Gemini 1.5 Flash** | **High-Volume Data Extraction.** Used for massive document parsing (PDFs, OCR) and semantic search vectorization. | ~320ms | $0.07 / $0.30 | Claude Sonnet 4.6 |
| **GPT-4o / GPT-5.4** | **Fallback & Legacy.** Maintained for specific Azure/OpenAI enterprise compliance environments and Voice APIs. | ~950ms | $5.00 / $15.00 | Claude Sonnet 4.6 |

*Metrics are derived from aggregated production telemetry across Danish mid-market deployments (n=12).*

## 2. Catching Hallucinations Before Production (Zod Validation)

A common misconception is that AI agents frequently "hallucinate" bad data into databases. In a properly architected system, an LLM never writes directly to a database. It outputs JSON, which is passed through a strict schema validator (like Zod) in TypeScript.

**Real-world Trace Log Example:**
1. LLM attempts to extract an invoice and sets `"tax_amount": "Unknown"`.
2. Zod Schema throws an error: `Expected Number, received String`.
3. The Agent Framework catches the `ZodError`, prevents the database insertion, and automatically re-prompts the LLM with the error context.
4. The LLM corrects the output to `"tax_amount": 0.00`.
5. Zod validation passes, and the ERP API is called deterministically.

*Result:* **99.8% Schema Pass Rate** in production environments. Zero hallucinated schemas reach the database layer.

## 3. CI/CD & LLM Evaluations (Evals)

Deploying a prompt update without running Evals is equivalent to deploying code without unit tests. Because LLMs are non-deterministic, Frobert's CI/CD pipelines require a passing Eval suite before code reaches the edge (Cloudflare/Vercel).

**Standard Eval Pipeline:**
- **Exact Match Evals:** Ensuring strict extraction tasks return perfectly identical strings on a test-set of 100 historical inputs.
- **Semantic Similarity Evals:** Using a smaller model to grade the new prompt's output against a golden dataset (Threshold: > 0.92 similarity).
- **Tool Call Evals:** Verifying that the LLM successfully chooses to call `query_database` instead of guessing the answer when presented with a missing data scenario.

## 4. Time-to-Production (TTP) Benchmarks

Because we use standardized agentic frameworks (Vercel AI SDK), strict IaC (Terraform/Wrangler), and edge deployments, our time-to-production (TTP) significantly undercuts traditional enterprise IT integrators.

| Provider Category | Median TTP | Data Source |
|-------------------|------------|-------------|
| Traditional IT Integrators | 16–24 weeks | Industry Standard |
| Internal Enterprise IT | 20–30 weeks | Industry Standard |
| **Frobert (Independent)** | **3–6 weeks** | **Internal Telemetry (n=12)** |

*Note: Datasets represent completed production deployments from 2024–2026. TTP is measured from project kickoff to first production API call processing real user data.*

## Further Reading

- [Hvor AI faktisk skaber værdi i danske virksomheder](https://frobert.dk/hvor-ai-virker-i-danske-virksomheder/)
- [Det bedste AI-signal i danske virksomheder](https://frobert.dk/det-bedste-ai-signal-i-danske-virksomheder/)
- [Sådan køber du AI-udvikling i Danmark](https://frobert.dk/ai-konsulent-danmark/)
