Google TurboQuant Compresses KV Cache 6x With Zero Accuracy Loss | Research
Google Research published TurboQuant at ICLR 2026 — a KV cache compression algorithm combining QJL and PolarQuant that achieves 6x memory reduction and up to 8…
Published on MyPrivateClaw
Apr 4, 2026, 4:30 AM UTC
Coverage date
Mar 24, 2026
Last updated
Apr 4, 2026, 5:23 AM UTC
News summary
TurboQuant targets the KV cache — the memory a transformer uses during inference to avoid recomputing attention for already processed tokens. On memory constrained hardware like the Raspberry Pi 5 or an 8GB GPU, the KV cache is often the binding constraint that limits context window size and model scale. The algorithm combines two techniques: QJL (a 1 bit error correction step using Johnson Lindenstrauss transforms that adds zero memory overhead) and PolarQuant (which converts KV vectors into polar coordinates, eliminating the memory overhead that standard quantization carries). Together they achieve 6x KV cache compression with zero accuracy loss on LongBench, RULER, Needle In A Haystack, and ZeroSCROLLS benchmarks across Gemma and Mistral model families. The practical implication: it makes Q4/Q3.5 KV cache quantization safe to use, freeing VRAM for larger models or longer context wind…