The Fine-Tuning Delta
Every zero-shot model scored below 0.11 on Macro-F1. Fine-tuning on 884 synthetic examples transformed performance — and the smallest model won.
Full Results
All systems evaluated on the same 52 hand-labeled real test examples. ECE measured after temperature scaling on logprob confidence.
| System | Params | Macro-F1 | 95% CI | ECE |
|---|---|---|---|---|
| Zero-shot Gemma 2B | 2B | 0.063 | [0.040, 0.089] | 0.755 |
| Zero-shot Mistral 7B | 7B | 0.095 | [0.063, 0.128] | 0.645 |
| Zero-shot Llama 3B | 3B | 0.108 | [0.029, 0.186] | 0.632 |
| Llama 1B LoRA | 1B | 0.196 | [0.117, 0.274] | 0.154 |
| Gemma 2B CF LoRA (r=8) | 2B | 0.249 | [0.151, 0.336] | 0.129 |
| Mistral 7B CF LoRA | 7B | 0.760 | [0.648, 0.852] | 0.075 |
| Llama 3.2 3B LoRA | 3B | 0.856 | [0.764, 0.930] | 0.094 |
| Gemma 2B Full LoRA | 2B | 0.916 | [0.813, 0.981] | 0.056 |
Per-Class F1 Across Models
Where each model excels and struggles. All three top adapters nail bot detection (1.0 or close), but committed_leave is universally the hardest class — its event traces overlap with comparison_shopping (both involve browsing without filling fields).
Calibration: ECE Comparison
Verbalized confidence ("the model says 0.85") is consistently overconfident. Extracting logprobs directly improves calibration. Temperature scaling pushes ECE below 0.06 for the Gemma 2B adapter — the model's probability distribution is more trustworthy than its words.
Reliability Diagram
The diagonal is perfect calibration. Before calibration, the model clusters predictions in the 0.8-1.0 range regardless of actual accuracy. After temperature scaling, predictions spread across the confidence range and track the diagonal more closely.
Confusion Matrix
Gemma 2B Full LoRA on 52 real test examples. The main failure mode: comparison_shopping misclassified as committed_leave (3 cases). Both classes involve browsing without form engagement — the distinguishing signal is subtle.
Training Loss Curves
All models converge by epoch 3 with no overfitting. The Gemma 2B CF adapter (r=8, constrained for Cloudflare) starts with much higher loss and never catches up — rank matters more than model size for this task.