Train Once, Deploy Twice
The same LoRA adapter runs on two inference backends. Cloudflare Workers AI serves at the edge with sub-200ms latency but only supports verbalized confidence. Modal serves via vLLM with full logprob access for calibrated confidence scores.
Method
Training data is generated synthetically using Claude Sonnet. Each API call produces one labeled example: a form interaction event trace paired with a class label (1-6), a reasoning explanation, and a confidence score. 1100 examples generated, 884 survived exact deduplication. The training set is intentionally imbalanced to match real-world traffic distributions.
Fine-tuning uses QLoRA — the base model is quantised to 4-bit NF4 precision, and a low-rank adapter (LoRA) is trained on top. Configuration: rank r=16, alpha=32, DoRA enabled, all linear layers targeted. This keeps the adapter under 100MB while capturing the full classification capability.
Training runs on a single Modal L4 GPU (24GB VRAM) for approximately 30 minutes per run at a cost of ~$0.40. Three epochs with cosine learning rate schedule, effective batch size of 16, and checkpoint selection by best validation loss.
The model outputs a leading digit (1-6) on the first line, followed by a JSON object with the class name, reasoning, and confidence. The leading digit is a single token in the Llama/Gemma vocabulary, which enables clean logprob extraction for calibration.
Evaluation uses 52 hand-labeled real test examples — never seen during training. These were crafted independently from the synthetic training data to test generalisation from synthetic to real event traces.
Calibration Pipeline
The model's first output token is always a digit 1-6, each mapping to one of six abandonment classes. In the Llama/Gemma tokenizer, each digit is a single token — so its logprob gives a clean per-class probability signal.
Verbalized confidence (the number the model writes in its JSON output) has an ECE of 0.145. Extracting the token logprob directly drops ECE to 0.103. Fitting a single temperature scalar T on the validation set brings ECE down to 0.056.
The calibration scalar T is fit by minimising negative log-likelihood on 108 validation examples. For the Gemma 2B adapter, T = 0.500 — the model is already slightly underconfident in its logprobs, so scaling sharpens the distribution.
Why Not Just Use CF Workers AI?
Cloudflare's BYO-LoRA endpoint does not expose token logprobs. The API returns only the generated text and optionally raw bytes. Without logprobs, the only confidence signal is whatever the model writes in its output — which is poorly calibrated.
For latency-sensitive, cost-sensitive paths where approximate confidence is acceptable, CF Workers AI is the right choice. For paths where confidence drives downstream decisions (like choosing a recovery flow), the Modal path with calibrated logprobs is necessary.