// WORK

AVA

MEASURED

A 2B model fine-tuned to beat Llama 3.2 3B on ARC — with a 42 MB adapter on a 4 GB GPU.

// Problem

Small models are assumed to need cloud clusters to be worth anything. AVA asks how much capability a 2B model can reach when the entire loop — training, evaluation, inference — must fit on a single 4 GB laptop GPU. No cloud, no cluster, no budget.

// Constraints

RTX A2000 laptop with 4 GB VRAM; training peaked at 1.81 GB. Maximum 384 training tokens, so long reasoning chains were never seen in training. Windows 11 toolchain, which meant making Triton, Flash-Linear-Attention, and BitsAndBytes coexist — every workaround is published. Every number is reproducible end-to-end on the same laptop.

// Architecture

A 42 MB QLoRA adapter on Qwen 3.5 2B, trained in about 100 minutes, released on HuggingFace with GGUF builds that run via Ollama. Custom Triton kernel work, verifier-RL, external memory. Evaluated on a 17-benchmark, 16,872-task harness with 95% Wilson confidence intervals. The weak spots are stated in the repo, not hidden: math, tool routing, narrative commonsense — all targeted by v3.

// WAR STORIES

The hardest bug wasn't in the code. Training drew power faster than the charger could supply it, so runs had to stop for the laptop to recharge — and early versions had no checkpoint resume, so every pause threw away all progress. Building checkpointing turned training from an endurance contest into something manageable.
Every intervention shipped at once — corpus mix, QLoRA config, base model choice. On hardware this slow, ablating one variable at a time wasn't affordable, and the laptop was also my daily machine for university. One shot, hoping for a drastic change; got one.
What v2 should have had: real tool calling — the corpus had ~55 tool examples against 20K math ones, and the model invokes tools in 0.6% of agentic runs — plus YaRN to push context past the 384-token training window. Both are v3 targets.

// RESULTS

Metric	Value
ARC-Challenge	82.0%
ARC-Easy	92.0%
Llama 3.2 3B-Instruct baseline	78.6%
MMLU 5-shot	59.2%
GSM8K (greedy / k=5)	35.3% / 44.0%
Adapter size	42 MB
Training peak VRAM	1.81 GB
Eval harness	17 benchmarks / 16,872 tasks

// STACK

Python
PyTorch
PEFT/QLoRA
Triton
GGUF