GPU Setup
ForgeAI uses GPU acceleration for inference (Test module), fine-tuning (Training module), and model conversion (Convert module).
Auto-Detection
Go to Settings (07) to see your detected hardware:
| Field | Description |
|---|
| NVIDIA | GPU name, VRAM, CUDA version |
| VULKAN | Cross-platform GPU API support |
| METAL | Apple Silicon support (macOS) |
llama.cpp Variants
For GGUF inference and quantization, install the appropriate llama.cpp variant:
CUDA (NVIDIA)
Vulkan (Cross-platform)
CPU
Fastest option for NVIDIA GPUs.Requirements:
- NVIDIA GPU (GTX 1060+ / RTX series)
- NVIDIA drivers 515+
In Settings → llama.cpp Tools → select CUDA → DOWNLOAD & INSTALL Works with NVIDIA, AMD, and Intel GPUs.Requirements:
- Vulkan-compatible GPU
- Vulkan drivers installed
In Settings → llama.cpp Tools → select VULKAN → DOWNLOAD & INSTALL Universal fallback, no GPU needed.Works on any system. Slower than GPU variants but always available.In Settings → llama.cpp Tools → select CPU → DOWNLOAD & INSTALL
Python Environments (Training & Convert)
ForgeAI manages two separate Python virtual environments, each with GPU-aware PyTorch:
Training Environment
Used by the Training module for LoRA, QLoRA, SFT, DPO, and full fine-tuning:
- NVIDIA GPU detected: PyTorch is installed with CUDA support automatically during setup
- No GPU: CPU-only PyTorch (training will be slow)
- Includes:
transformers, peft, trl, bitsandbytes, datasets
Convert Environment
Used by the Convert module for SafeTensors-to-GGUF conversion and by Test for SafeTensors inference:
- NVIDIA GPU detected: PyTorch is installed with CUDA support automatically during setup
- No GPU: CPU-only PyTorch is installed
- OOM fallback: If the model doesn’t fit in GPU VRAM, ForgeAI automatically falls back to CPU inference
Both environments can be managed (viewed, cleaned, deleted) in Settings.
VRAM Requirements
Inference
Approximate VRAM needed to load models on GPU:
| Model Size | Q4_K_M | Q8_0 | F16 |
|---|
| 7B | ~4.5 GB | ~7.5 GB | ~14 GB |
| 13B | ~8 GB | ~14 GB | ~26 GB |
| 70B | ~40 GB | ~70 GB | ~140 GB |
If your model exceeds VRAM, GGUF inference via llama.cpp can offload some layers to CPU RAM. SafeTensors inference will auto-fallback to full CPU mode.
Training
Approximate VRAM needed for fine-tuning a 7B model:
| Method | Minimum VRAM | Recommended |
|---|
| QLoRA (4-bit) | 4 GB | 8 GB |
| LoRA | 6 GB | 12 GB |
| SFT | 8 GB | 16 GB |
| DPO | 8 GB | 16 GB |
| Full Fine-Tune | 16 GB | 24+ GB |
Layer surgery (remove/duplicate layers) is pure Rust and requires no GPU or Python — it works on any system.
Troubleshooting
| Issue | Solution |
|---|
| GPU not detected | Update NVIDIA/Vulkan drivers |
| CUDA variant fails to install | Ensure NVIDIA drivers are 515+ |
| Slow inference despite GPU | Check Settings to confirm CUDA/Vulkan variant is installed, not CPU |
| Out of memory (inference) | Use a smaller quantization (Q4_K_M instead of Q8_0) or switch to CPU |
| Out of memory (training) | Switch to QLoRA, reduce batch size or max_seq_length |
| SafeTensors shows CPU device | Re-run Convert module setup to reinstall PyTorch with CUDA |
| Training not using GPU | Verify CUDA is available in Settings → Training Environment |