Quantization Levels
Quantization reduces model precision to decrease file size and increase inference speed, with a trade-off in output quality.Overview
| Level | Type | Bits/Weight | Est. Quality | Size (7B) | Speed |
|---|---|---|---|---|---|
| EXTREME | Q2_K | 2.63 | ~60% | ~2.5 GB | Fastest |
| TINY | Q3_K_S | 3.50 | ~68% | ~3.2 GB | Very fast |
| SMALL | Q3_K_M | 3.91 | ~72% | ~3.6 GB | Fast |
| COMPACT | Q4_K_M | 4.85 | ~80% | ~4.4 GB | Fast |
| BALANCED | Q5_K_M | 5.69 | ~87% | ~5.1 GB | Moderate |
| HIGH | Q6_K | 6.56 | ~93% | ~5.9 GB | Moderate |
| ULTRA | Q8_0 | 8.50 | ~98% | ~7.6 GB | Slower |
Sizes shown are approximate for a 7B parameter model. Actual sizes depend on model architecture.
How Quantization Works
Full-precision models store each weight as a 16-bit or 32-bit floating point number. Quantization maps these values to lower bit representations:- Q8_0: 8-bit integer quantization with minimal quality loss
- Q6_K: 6-bit with K-quant optimization
- Q4_K_M: 4-bit mixed precision (important layers get higher precision)
- Q2_K: 2-bit aggressive compression
Choosing a Level
Mobile / Edge
Q3_K_M or Q4_K_MFor phones, Raspberry Pi, or systems with less than 8 GB RAM. Noticeable quality loss but usable.
Desktop / Laptop
Q5_K_M (recommended)Best balance for most users. Good quality with significant size reduction.
Server / Production
Q8_0Near-original quality. Use when output quality is critical and storage/RAM is not a concern.
Requantization Warning
If your model is already Q4_K_M, quantizing to Q8_0 will not improve quality — it just makes the file bigger with no benefit.RAM Requirements
As a rule of thumb, you need roughly 1.2x the file size in RAM to load a GGUF model for inference:| Quantization | 7B Model | 13B Model | 70B Model |
|---|---|---|---|
| Q4_K_M | ~5.3 GB | ~9.6 GB | ~42 GB |
| Q5_K_M | ~6.1 GB | ~11 GB | ~49 GB |
| Q8_0 | ~9.1 GB | ~17 GB | ~74 GB |