Skip to main content

Compress (03)

Quantize GGUF models to smaller sizes using llama-quantize. Preview estimated size, quality, and speed before running. Compress Module
Requires a loaded GGUF model and llama.cpp tools installed (see Settings).

Quantization Levels

LevelTypeBits/WeightQualityUse Case
EXTREMEQ2_K2.63~60%Smallest possible, significant quality loss
TINYQ3_K_S3.50~68%Very small, noticeable quality loss
SMALLQ3_K_M3.91~72%Small with better quality
COMPACTQ4_K_M4.85~80%Good balance for most use cases
BALANCEDQ5_K_M5.69~87%Recommended general purpose
HIGHQ6_K6.56~93%Near-original quality
ULTRAQ8_08.50~98%Minimal quality loss

Presets

MOBILE

Q3_K_M — Edge devices, phones, low-RAM systems

BALANCED

Q5_K_M — General purpose desktops and laptops

QUALITY

Q8_0 — Production servers, quality-critical applications

Size & Quality Preview

Before quantizing, a preview shows:
  • Before/After file sizes with reduction percentage
  • Component breakdown: attention, MLP, embeddings, output head, norms
  • Quality estimate bar
  • Speed improvement relative to original
Requantizing an already-quantized model produces worse results than quantizing from a high-precision source (F16/F32).

Workflow

1

Load a GGUF model

Use the Load module to import a GGUF file
2

Select target level

Click a level button or preset card
3

Review preview

Check estimated size, quality, and speed
4

Quantize

Click QUANTIZE MODEL, choose output path, monitor progress
The original model is never modified — output is always a new GGUF file.