// WORK
AVA
MEASUREDA 2B model fine-tuned to beat Llama 3.2 3B on ARC — with a 42 MB adapter on a 4 GB GPU.
// Problem
Small models are assumed to need cloud clusters to be worth anything. AVA asks how much capability a 2B model can reach when the entire loop — training, evaluation, inference — must fit on a single 4 GB laptop GPU. No cloud, no cluster, no budget.
// Constraints
RTX A2000 laptop with 4 GB VRAM; training peaked at 1.81 GB. Maximum 384 training tokens, so long reasoning chains were never seen in training. Windows 11 toolchain, which meant making Triton, Flash-Linear-Attention, and BitsAndBytes coexist — every workaround is published. Every number is reproducible end-to-end on the same laptop.
// Architecture
A 42 MB QLoRA adapter on Qwen 3.5 2B, trained in about 100 minutes, released on HuggingFace with GGUF builds that run via Ollama. Custom Triton kernel work, verifier-RL, external memory. Evaluated on a 17-benchmark, 16,872-task harness with 95% Wilson confidence intervals. The weak spots are stated in the repo, not hidden: math, tool routing, narrative commonsense — all targeted by v3.
// WAR STORIES
- The hardest bug wasn't in the code. Training drew power faster than the charger could supply it, so runs had to stop for the laptop to recharge — and early versions had no checkpoint resume, so every pause threw away all progress. Building checkpointing turned training from an endurance contest into something manageable.
- Every intervention shipped at once — corpus mix, QLoRA config, base model choice. On hardware this slow, ablating one variable at a time wasn't affordable, and the laptop was also my daily machine for university. One shot, hoping for a drastic change; got one.
- What v2 should have had: real tool calling — the corpus had ~55 tool examples against 20K math ones, and the model invokes tools in 0.6% of agentic runs — plus YaRN to push context past the 384-token training window. Both are v3 targets.
// RESULTS
| Metric | Value |
|---|---|
| ARC-Challenge | 82.0% |
| ARC-Easy | 92.0% |
| Llama 3.2 3B-Instruct baseline | 78.6% |
| MMLU 5-shot | 59.2% |
| GSM8K (greedy / k=5) | 35.3% / 44.0% |
| Adapter size | 42 MB |
| Training peak VRAM | 1.81 GB |
| Eval harness | 17 benchmarks / 16,872 tasks |
// STACK
- Python
- PyTorch
- PEFT/QLoRA
- Triton
- GGUF