MyPrivateClaw

Google TurboQuant Compresses KV Cache 6x With Zero Accuracy Loss | Research

Google Research published TurboQuant at ICLR 2026 — a KV cache compression algorithm combining QJL and PolarQuant that achieves 6x memory reduction and up to 8…

Published on MyPrivateClaw

Apr 4, 2026, 4:30 AM UTC

Coverage date

Mar 24, 2026

Last updated

Apr 4, 2026, 5:23 AM UTC

News summary

TurboQuant targets the KV cache — the memory a transformer uses during inference to avoid recomputing attention for already processed tokens. On memory constrained hardware like the Raspberry Pi 5 or an 8GB GPU, the KV cache is often the binding constraint that limits context window size and model scale. The algorithm combines two techniques: QJL (a 1 bit error correction step using Johnson Lindenstrauss transforms that adds zero memory overhead) and PolarQuant (which converts KV vectors into polar coordinates, eliminating the memory overhead that standard quantization carries). Together they achieve 6x KV cache compression with zero accuracy loss on LongBench, RULER, Needle In A Haystack, and ZeroSCROLLS benchmarks across Gemma and Mistral model families. The practical implication: it makes Q4/Q3.5 KV cache quantization safe to use, freeing VRAM for larger models or longer context wind…