Experimental · Google Research, ICLR 2026

ppmlx.TQ

KV Cache Compression for Apple Silicon

Your Mac has 16 GB. A 9B model takes 10 GB. At 64k context, KV cache adds another 1 GB — and you're swapping. ppmlx.TQ compresses it to 256 MB with near-zero quality loss.

Star

The problem

Tool-heavy agents like Claude Code send 38 tools (~14k tokens) per request. On a 9B model with 64k context, that's 1 GB of KV cache in unified memory. ppmlx.TQ implements Google's TurboQuant natively in MLX to compress it 4–5x.

Quality vs compression

Attention score error relative to fp16 baseline. Lower = better. Measured on d=128, 32 heads, 8k tokens. TurboQuant paper

ppmlx.TQ 3-bit + QJL: 40% lower error than mx.quantize 4-bit, at 11% less memory.

Full comparison data

mx.quantize uses adaptive per-group affine quantization with scale + bias per group of 64 values, adding +0.5 effective bits of overhead. This is why 4-bit shows as 4.5 effective bits and 2-bit as 2.5. ppmlx.TQ uses fixed analytical centroids from the paper — no per-group metadata, so nominal bits = effective bits. QJL adds 1 bit for error correction.

Method	Eff. bits	8k ctx	64k ctx	Error	Note
fp16	16.0	128 MB	1 GB	0	baseline
`mx.quantize` 8b	8.5	68 MB	544 MB	0.005	scale+bias per group
`mx.quantize` 4b	4.5	36 MB	288 MB	0.091	scale+bias per group
`ppmlx.TQ` 3b+QJL	4.0	32 MB	256 MB	0.055	40% lower error than mx 4b, 11% less memory
`ppmlx.TQ` 4b	4.0	32 MB	256 MB	0.078	Better than mx 4b at less memory
`ppmlx.TQ` 3b	3.0	24 MB	192 MB	0.152	Fixed Lloyd-Max centroids
`ppmlx.TQ` 2b+QJL	3.0	24 MB	192 MB	0.108	Near mx 4b quality at 33% less memory
`ppmlx.TQ` 2b	2.0	16 MB	128 MB	0.310	21% lower error than mx 2b
`mx.quantize` 2b	2.5	20 MB	160 MB	0.394	scale+bias per group

PoC measurements (no QJL yet)

Two gaps vs production: (1) No QJL yet — ~30% error reduction expected per paper. (2) Naive argmin quantizer — production uses O(1) decision boundaries.

Quality — relative attention error

Method	Eff. bits	64 tok	1k tok	4k tok	8k tok
`mx.quantize` 8b	8.5	0.005	0.005	0.005	0.005
`mx.quantize` 4b	4.5	0.089	0.090	0.091	0.091
`ppmlx.TQ` 4b	4.0	0.077	0.077	0.078	0.078
`ppmlx.TQ` 3b	3.0	0.150	0.151	0.153	0.152
`ppmlx.TQ` 2b	2.0	0.314	0.308	0.311	0.310
`mx.quantize` 2b	2.5	0.392	0.390	0.395	0.394

How it works

Three stages from TurboQuant (Google Research, ICLR 2026), implemented natively in MLX with zero external dependencies.

Random Rotation

Orthogonal matrix via QR decomposition normalizes KV vector distribution. One-time cost per cache instance (~1ms).

Lloyd-Max Quantization

Each coordinate quantized with precomputed MSE-optimal centroids. No per-group metadata, no training.

QJL Error Correction

1-bit Johnson-Lindenstrauss projection produces unbiased inner product estimates. Critical for attention accuracy.

Roadmap

Done

Algorithm PoC

PolarQuant rotation + Lloyd-Max quantization verified on synthetic data. Attention error within paper bounds.

Done

MLX ops verified

mx.linalg.qr, mx.quantize, matmul on Metal GPU — all ops available in MLX.

Done

Qwen3 compatibility

Pure full attention (no sliding window). Compatible with prefix caching and custom KV cache.

In progress

TurboQuantKVCache

Native _BaseCache subclass with rotation, quantization, and dequantization on fetch.

QJL error correction

1-bit Johnson-Lindenstrauss projection for unbiased attention score estimates.

E2E benchmarks

Real model inference on Qwen3 9B. TTFT, tok/s, memory, and quality metrics.

Reference

TurboQuant: Online KV Cache Quantization via Clustering
Google Research · ICLR 2026

Demonstrates near-lossless 2-3 bit KV cache compression using PolarQuant (random rotation + Lloyd-Max quantization) with QJL error correction. Maintains attention accuracy with O(1/d) variance guarantees.
arxiv.org/abs/2504.19874

ppmlx.TQ is an independent, MLX-native implementation. Not affiliated with Google.