ppmlx.TQ
KV Cache Compression for Apple Silicon
Your Mac has 16 GB. A 9B model takes 10 GB. At 64k context, KV cache adds
another 1 GB — and you're swapping. ppmlx.TQ compresses
it to 256 MB with near-zero quality loss.
The problem
Tool-heavy agents like Claude Code send 38 tools (~14k tokens) per request.
On a 9B model with 64k context, that's 1 GB of KV cache in unified memory.
ppmlx.TQ implements
Google's TurboQuant
natively in MLX to compress it 4–5x.
Quality vs compression
Attention score error relative to fp16 baseline. Lower = better. Measured on d=128, 32 heads, 8k tokens. TurboQuant paper
ppmlx.TQ 3-bit + QJL: 40% lower error than mx.quantize 4-bit, at 11% less memory.
Full comparison data
mx.quantize uses adaptive per-group affine quantization with scale + bias per group of 64 values,
adding +0.5 effective bits of overhead. This is why 4-bit shows as 4.5 effective bits and 2-bit as 2.5.
ppmlx.TQ uses fixed analytical centroids from
the paper —
no per-group metadata, so nominal bits = effective bits. QJL adds 1 bit for error correction.
| Method | Eff. bits | 8k ctx | 64k ctx | Error | Note |
|---|---|---|---|---|---|
| fp16 | 16.0 | 128 MB | 1 GB | 0 | baseline |
mx.quantize 8b | 8.5 | 68 MB | 544 MB | 0.005 | scale+bias per group |
mx.quantize 4b | 4.5 | 36 MB | 288 MB | 0.091 | scale+bias per group |
ppmlx.TQ 3b+QJL | 4.0 | 32 MB | 256 MB | 0.055 | 40% lower error than mx 4b, 11% less memory |
ppmlx.TQ 4b | 4.0 | 32 MB | 256 MB | 0.078 | Better than mx 4b at less memory |
ppmlx.TQ 3b | 3.0 | 24 MB | 192 MB | 0.152 | Fixed Lloyd-Max centroids |
ppmlx.TQ 2b+QJL | 3.0 | 24 MB | 192 MB | 0.108 | Near mx 4b quality at 33% less memory |
ppmlx.TQ 2b | 2.0 | 16 MB | 128 MB | 0.310 | 21% lower error than mx 2b |
mx.quantize 2b | 2.5 | 20 MB | 160 MB | 0.394 | scale+bias per group |
PoC measurements (no QJL yet)
Two gaps vs production: (1) No QJL yet — ~30% error reduction expected per
paper.
(2) Naive argmin quantizer — production uses O(1) decision boundaries.
Quality — relative attention error
| Method | Eff. bits | 64 tok | 1k tok | 4k tok | 8k tok |
|---|---|---|---|---|---|
mx.quantize 8b | 8.5 | 0.005 | 0.005 | 0.005 | 0.005 |
mx.quantize 4b | 4.5 | 0.089 | 0.090 | 0.091 | 0.091 |
ppmlx.TQ 4b | 4.0 | 0.077 | 0.077 | 0.078 | 0.078 |
ppmlx.TQ 3b | 3.0 | 0.150 | 0.151 | 0.153 | 0.152 |
ppmlx.TQ 2b | 2.0 | 0.314 | 0.308 | 0.311 | 0.310 |
mx.quantize 2b | 2.5 | 0.392 | 0.390 | 0.395 | 0.394 |
How it works
Three stages from TurboQuant (Google Research, ICLR 2026), implemented natively in MLX with zero external dependencies.
Random Rotation
Orthogonal matrix via QR decomposition normalizes KV vector distribution. One-time cost per cache instance (~1ms).
Lloyd-Max Quantization
Each coordinate quantized with precomputed MSE-optimal centroids. No per-group metadata, no training.
QJL Error Correction
1-bit Johnson-Lindenstrauss projection produces unbiased inner product estimates. Critical for attention accuracy.
Roadmap
Algorithm PoC
PolarQuant rotation + Lloyd-Max quantization verified on synthetic data. Attention error within paper bounds.
MLX ops verified
mx.linalg.qr, mx.quantize, matmul on Metal GPU — all ops available in MLX.
Qwen3 compatibility
Pure full attention (no sliding window). Compatible with prefix caching and custom KV cache.
TurboQuantKVCache
Native _BaseCache subclass with rotation, quantization, and dequantization on fetch.
QJL error correction
1-bit Johnson-Lindenstrauss projection for unbiased attention score estimates.
E2E benchmarks
Real model inference on Qwen3 9B. TTFT, tok/s, memory, and quality metrics.
Reference
TurboQuant: Online KV Cache Quantization via Clustering
Google Research · ICLR 2026Demonstrates near-lossless 2-3 bit KV cache compression using PolarQuant (random rotation + Lloyd-Max quantization) with QJL error correction. Maintains attention accuracy with O(1/d) variance guarantees.
arxiv.org/abs/2504.19874
ppmlx.TQ is an independent, MLX-native implementation. Not affiliated with Google.