Quantization Levels

Quantization reduces model precision to decrease file size and increase inference speed, with a trade-off in output quality.

Overview

Level	Type	Bits/Weight	Est. Quality	Size (7B)	Speed
EXTREME	Q2_K	2.63	~60%	~2.5 GB	Fastest
TINY	Q3_K_S	3.50	~68%	~3.2 GB	Very fast
SMALL	Q3_K_M	3.91	~72%	~3.6 GB	Fast
COMPACT	Q4_K_M	4.85	~80%	~4.4 GB	Fast
BALANCED	Q5_K_M	5.69	~87%	~5.1 GB	Moderate
HIGH	Q6_K	6.56	~93%	~5.9 GB	Moderate
ULTRA	Q8_0	8.50	~98%	~7.6 GB	Slower

Sizes shown are approximate for a 7B parameter model. Actual sizes depend on model architecture.

How Quantization Works

Full-precision models store each weight as a 16-bit or 32-bit floating point number. Quantization maps these values to lower bit representations:

Q8_0: 8-bit integer quantization with minimal quality loss
Q6_K: 6-bit with K-quant optimization
Q4_K_M: 4-bit mixed precision (important layers get higher precision)
Q2_K: 2-bit aggressive compression

The “K” in K-quant types means the quantization uses importance-based allocation — more important tensors get higher precision.

Choosing a Level

Mobile / Edge

Q3_K_M or Q4_K_MFor phones, Raspberry Pi, or systems with less than 8 GB RAM. Noticeable quality loss but usable.

Desktop / Laptop

Q5_K_M (recommended)Best balance for most users. Good quality with significant size reduction.

Server / Production

Q8_0Near-original quality. Use when output quality is critical and storage/RAM is not a concern.

Requantization Warning

Quantizing an already-quantized model degrades quality further. Always quantize from the highest available precision (F16 or F32 source).

If your model is already Q4_K_M, quantizing to Q8_0 will not improve quality — it just makes the file bigger with no benefit.

RAM Requirements

As a rule of thumb, you need roughly 1.2x the file size in RAM to load a GGUF model for inference:

Quantization	7B Model	13B Model	70B Model
Q4_K_M	~5.3 GB	~9.6 GB	~42 GB
Q5_K_M	~6.1 GB	~11 GB	~49 GB
Q8_0	~9.1 GB	~17 GB	~74 GB

Getting Started

Modules

Guides

Quantization Levels

Quantization Levels

Overview

How Quantization Works

Choosing a Level

Mobile / Edge

Desktop / Laptop

Server / Production

Requantization Warning

RAM Requirements

Getting Started

Modules

Guides

​Quantization Levels

​Overview

​How Quantization Works

​Choosing a Level

Mobile / Edge

Desktop / Laptop

Server / Production

​Requantization Warning

​RAM Requirements

Quantization Levels

Overview

How Quantization Works

Choosing a Level

Requantization Warning

RAM Requirements