Skip to main content

Quantization Levels

Quantization reduces model precision to decrease file size and increase inference speed, with a trade-off in output quality.

Overview

LevelTypeBits/WeightEst. QualitySize (7B)Speed
EXTREMEQ2_K2.63~60%~2.5 GBFastest
TINYQ3_K_S3.50~68%~3.2 GBVery fast
SMALLQ3_K_M3.91~72%~3.6 GBFast
COMPACTQ4_K_M4.85~80%~4.4 GBFast
BALANCEDQ5_K_M5.69~87%~5.1 GBModerate
HIGHQ6_K6.56~93%~5.9 GBModerate
ULTRAQ8_08.50~98%~7.6 GBSlower
Sizes shown are approximate for a 7B parameter model. Actual sizes depend on model architecture.

How Quantization Works

Full-precision models store each weight as a 16-bit or 32-bit floating point number. Quantization maps these values to lower bit representations:
  • Q8_0: 8-bit integer quantization with minimal quality loss
  • Q6_K: 6-bit with K-quant optimization
  • Q4_K_M: 4-bit mixed precision (important layers get higher precision)
  • Q2_K: 2-bit aggressive compression
The “K” in K-quant types means the quantization uses importance-based allocation — more important tensors get higher precision.

Choosing a Level

Mobile / Edge

Q3_K_M or Q4_K_MFor phones, Raspberry Pi, or systems with less than 8 GB RAM. Noticeable quality loss but usable.

Desktop / Laptop

Q5_K_M (recommended)Best balance for most users. Good quality with significant size reduction.

Server / Production

Q8_0Near-original quality. Use when output quality is critical and storage/RAM is not a concern.

Requantization Warning

Quantizing an already-quantized model degrades quality further. Always quantize from the highest available precision (F16 or F32 source).
If your model is already Q4_K_M, quantizing to Q8_0 will not improve quality — it just makes the file bigger with no benefit.

RAM Requirements

As a rule of thumb, you need roughly 1.2x the file size in RAM to load a GGUF model for inference:
Quantization7B Model13B Model70B Model
Q4_K_M~5.3 GB~9.6 GB~42 GB
Q5_K_M~6.1 GB~11 GB~49 GB
Q8_0~9.1 GB~17 GB~74 GB