System Design

How It Works

Train once on Modal, deploy twice — to Modal for calibrated inference and Cloudflare Workers AI for edge latency.

Train Once, Deploy Twice

The same LoRA adapter runs on two inference backends. Cloudflare Workers AI serves at the edge with sub-200ms latency but only supports verbalized confidence. Modal serves via vLLM with full logprob access for calibrated confidence scores.

Training Data (884 examples) │ ▼ ┌───────────────────────┐ │ Modal L4 GPU │ │ QLoRA + PEFT + TRL │ │ ~30 min, ~$0.40 │ └───────────┬───────────┘ │ LoRA Adapter ┌─────┴─────┐ │ │ ▼ ▼ ┌──────────────┐ ┌──────────────┐ │ Modal vLLM │ │ CF Workers │ │ Gemma 2B │ │ AI (edge) │ │ │ │ Mistral 7B │ │ logprobs ✓ │ │ logprobs ✗ │ │ calibrated │ │ verbalized │ └──────────────┘ └──────────────┘

Method

Training data is generated synthetically using Claude Sonnet. Each API call produces one labeled example: a form interaction event trace paired with a class label (1-6), a reasoning explanation, and a confidence score. 1100 examples generated, 884 survived exact deduplication. The training set is intentionally imbalanced to match real-world traffic distributions.

Fine-tuning uses QLoRA — the base model is quantised to 4-bit NF4 precision, and a low-rank adapter (LoRA) is trained on top. Configuration: rank r=16, alpha=32, DoRA enabled, all linear layers targeted. This keeps the adapter under 100MB while capturing the full classification capability.

Training runs on a single Modal L4 GPU (24GB VRAM) for approximately 30 minutes per run at a cost of ~$0.40. Three epochs with cosine learning rate schedule, effective batch size of 16, and checkpoint selection by best validation loss.

The model outputs a leading digit (1-6) on the first line, followed by a JSON object with the class name, reasoning, and confidence. The leading digit is a single token in the Llama/Gemma vocabulary, which enables clean logprob extraction for calibration.

Evaluation uses 52 hand-labeled real test examples — never seen during training. These were crafted independently from the synthetic training data to test generalisation from synthetic to real event traces.

Calibration Pipeline

The model's first output token is always a digit 1-6, each mapping to one of six abandonment classes. In the Llama/Gemma tokenizer, each digit is a single token — so its logprob gives a clean per-class probability signal.

Verbalized confidence (the number the model writes in its JSON output) has an ECE of 0.145. Extracting the token logprob directly drops ECE to 0.103. Fitting a single temperature scalar T on the validation set brings ECE down to 0.056.

The calibration scalar T is fit by minimising negative log-likelihood on 108 validation examples. For the Gemma 2B adapter, T = 0.500 — the model is already slightly underconfident in its logprobs, so scaling sharpens the distribution.

Why Not Just Use CF Workers AI?

Cloudflare's BYO-LoRA endpoint does not expose token logprobs. The API returns only the generated text and optionally raw bytes. Without logprobs, the only confidence signal is whatever the model writes in its output — which is poorly calibrated.

For latency-sensitive, cost-sensitive paths where approximate confidence is acceptable, CF Workers AI is the right choice. For paths where confidence drives downstream decisions (like choosing a recovery flow), the Modal path with calibrated logprobs is necessary.