THE BOTTLENECK
Large language models generate one token at a time, and each token requires a full forward pass through every layer of the network. On a 70-billion-parameter model, that pass is memory-bandwidth-bound — the GPU spends most of its time loading weights from HBM, not computing. Latency scales with output length, not prompt length.
THE TRICK
Speculative decoding runs a small, fast draft model that proposes several tokens ahead. The large model then verifies all of them in a single forward pass — the same pass that would otherwise produce just one token. Accepted drafts are kept; the first rejection is corrected and decoding continues from there.
WHY IT'S LOSSLESS
The verification step uses rejection sampling — a statistical technique that guarantees the final output distribution exactly matches what the large model would have produced on its own. Speculative decoding is not an approximation or a quality tradeoff; the published math (Leviathan et al., Google 2022) proves output equivalence.
THE ACCEPTANCE RATE
Speedup depends on how often the draft model guesses what the target model would say. On easy tokens — common words, predictable syntax — acceptance rates exceed 80%. On hard tokens — proper nouns, novel reasoning — they fall sharply. Real-world workloads average 2–4x, with code generation benefiting most because of its repetitive structure.
WHY EDGE MATTERS
Apache 2.0 means the weights can be embedded in commercial products without licensing fees or usage telemetry — the same license that made Linux and Android the substrates of modern computing. Open weights plus 3x decoding makes on-device inference economically viable: no per-token API cost, no round trip to a datacenter, no prompt leaving the user's machine.
QUICK CHECK
Speculative decoding speeds up inference without changing what the model outputs.