Experimental · Google Research, ICLR 2026

ppmlx.TQ

KV Cache Compression for Apple Silicon

Your Mac has 16 GB. A 9B model takes 10 GB. At 64k context, KV cache adds another 1 GB — and you're swapping. ppmlx.TQ compresses it to 256 MB with near-zero quality loss.

The problem

Tool-heavy agents like Claude Code send 38 tools (~14k tokens) per request. On a 9B model with 64k context, that's 1 GB of KV cache in unified memory. ppmlx.TQ implements Google's TurboQuant natively in MLX to compress it 4–5x.

Quality vs compression

Attention score error relative to fp16 baseline. Lower = better. Measured on d=128, 32 heads, 8k tokens. TurboQuant paper

ppmlx.TQ 3-bit + QJL: 40% lower error than mx.quantize 4-bit, at 11% less memory.

Full comparison data

mx.quantize uses adaptive per-group affine quantization with scale + bias per group of 64 values, adding +0.5 effective bits of overhead. This is why 4-bit shows as 4.5 effective bits and 2-bit as 2.5. ppmlx.TQ uses fixed analytical centroids from the paper — no per-group metadata, so nominal bits = effective bits. QJL adds 1 bit for error correction.

MethodEff. bits8k ctx64k ctxErrorNote
fp1616.0128 MB1 GB0baseline
mx.quantize 8b8.568 MB544 MB0.005scale+bias per group
mx.quantize 4b4.536 MB288 MB0.091scale+bias per group
ppmlx.TQ 3b+QJL4.032 MB256 MB0.05540% lower error than mx 4b, 11% less memory
ppmlx.TQ 4b4.032 MB256 MB0.078Better than mx 4b at less memory
ppmlx.TQ 3b3.024 MB192 MB0.152Fixed Lloyd-Max centroids
ppmlx.TQ 2b+QJL3.024 MB192 MB0.108Near mx 4b quality at 33% less memory
ppmlx.TQ 2b2.016 MB128 MB0.31021% lower error than mx 2b
mx.quantize 2b2.520 MB160 MB0.394scale+bias per group
PoC measurements (no QJL yet)

Two gaps vs production: (1) No QJL yet — ~30% error reduction expected per paper. (2) Naive argmin quantizer — production uses O(1) decision boundaries.

Quality — relative attention error

MethodEff. bits64 tok1k tok4k tok8k tok
mx.quantize 8b8.50.0050.0050.0050.005
mx.quantize 4b4.50.0890.0900.0910.091
ppmlx.TQ 4b4.00.0770.0770.0780.078
ppmlx.TQ 3b3.00.1500.1510.1530.152
ppmlx.TQ 2b2.00.3140.3080.3110.310
mx.quantize 2b2.50.3920.3900.3950.394

How it works

Three stages from TurboQuant (Google Research, ICLR 2026), implemented natively in MLX with zero external dependencies.

1

Random Rotation

Orthogonal matrix via QR decomposition normalizes KV vector distribution. One-time cost per cache instance (~1ms).

2

Lloyd-Max Quantization

Each coordinate quantized with precomputed MSE-optimal centroids. No per-group metadata, no training.

3

QJL Error Correction

1-bit Johnson-Lindenstrauss projection produces unbiased inner product estimates. Critical for attention accuracy.

Roadmap

Done

Algorithm PoC

PolarQuant rotation + Lloyd-Max quantization verified on synthetic data. Attention error within paper bounds.

Done

MLX ops verified

mx.linalg.qr, mx.quantize, matmul on Metal GPU — all ops available in MLX.

Done

Qwen3 compatibility

Pure full attention (no sliding window). Compatible with prefix caching and custom KV cache.

In progress

TurboQuantKVCache

Native _BaseCache subclass with rotation, quantization, and dequantization on fetch.

Next

QJL error correction

1-bit Johnson-Lindenstrauss projection for unbiased attention score estimates.

Next

E2E benchmarks

Real model inference on Qwen3 9B. TTFT, tok/s, memory, and quality metrics.

Reference

TurboQuant: Online KV Cache Quantization via Clustering
Google Research · ICLR 2026

Demonstrates near-lossless 2-3 bit KV cache compression using PolarQuant (random rotation + Lloyd-Max quantization) with QJL error correction. Maintains attention accuracy with O(1/d) variance guarantees.

arxiv.org/abs/2504.19874

ppmlx.TQ is an independent, MLX-native implementation. Not affiliated with Google.