tech 3d ago

Google Gemma Gains Speculative Decoding

Mountain ViewGoogle’s Gemma 4 hits 3x faster output.

A 74-million-parameter drafter proposes tokens that Gemma 4 verifies in parallel, cutting latency on consumer hardware, Google said.

Gemma 4 ships under Apache 2.0; drafter weights are open for any developer to deploy.

Sources: Ars Technica

Background

THE BOTTLENECK

Large language models generate one token at a time, and each token requires a full forward pass through every layer of the network. On a 70-billion-parameter model, that pass is memory-bandwidth-bound — the GPU spends most of its time loading weights from HBM, not computing. Latency scales with output length, not prompt length.

THE TRICK

Speculative decoding runs a small, fast draft model that proposes several tokens ahead. The large model then verifies all of them in a single forward pass — the same pass that would otherwise produce just one token. Accepted drafts are kept; the first rejection is corrected and decoding continues from there.

WHY IT'S LOSSLESS

The verification step uses rejection sampling — a statistical technique that guarantees the final output distribution exactly matches what the large model would have produced on its own. Speculative decoding is not an approximation or a quality tradeoff; the published math (Leviathan et al., Google 2022) proves output equivalence.

THE ACCEPTANCE RATE

Speedup depends on how often the draft model guesses what the target model would say. On easy tokens — common words, predictable syntax — acceptance rates exceed 80%. On hard tokens — proper nouns, novel reasoning — they fall sharply. Real-world workloads average 2–4x, with code generation benefiting most because of its repetitive structure.

WHY EDGE MATTERS

Apache 2.0 means the weights can be embedded in commercial products without licensing fees or usage telemetry — the same license that made Linux and Android the substrates of modern computing. Open weights plus 3x decoding makes on-device inference economically viable: no per-token API cost, no round trip to a datacenter, no prompt leaving the user's machine.

QUICK CHECK

Speculative decoding speeds up inference without changing what the model outputs.