GPU Setup

ForgeAI uses GPU acceleration for inference (Test module), fine-tuning (Training module), and model conversion (Convert module).

Auto-Detection

Go to Settings (07) to see your detected hardware:

Field	Description
NVIDIA	GPU name, VRAM, CUDA version
VULKAN	Cross-platform GPU API support
METAL	Apple Silicon support (macOS)

llama.cpp Variants

For GGUF inference and quantization, install the appropriate llama.cpp variant:

CUDA (NVIDIA)
Vulkan (Cross-platform)
CPU

Fastest option for NVIDIA GPUs.Requirements:

NVIDIA GPU (GTX 1060+ / RTX series)
NVIDIA drivers 515+

In Settings → llama.cpp Tools → select CUDA → DOWNLOAD & INSTALL

Python Environments (Training & Convert)

ForgeAI manages two separate Python virtual environments, each with GPU-aware PyTorch:

Training Environment

Used by the Training module for LoRA, QLoRA, SFT, DPO, and full fine-tuning:

NVIDIA GPU detected: PyTorch is installed with CUDA support automatically during setup
No GPU: CPU-only PyTorch (training will be slow)
Includes: transformers, peft, trl, bitsandbytes, datasets

Convert Environment

Used by the Convert module for SafeTensors-to-GGUF conversion and by Test for SafeTensors inference:

NVIDIA GPU detected: PyTorch is installed with CUDA support automatically during setup
No GPU: CPU-only PyTorch is installed
OOM fallback: If the model doesn’t fit in GPU VRAM, ForgeAI automatically falls back to CPU inference

Both environments can be managed (viewed, cleaned, deleted) in Settings.

VRAM Requirements

Inference

Approximate VRAM needed to load models on GPU:

Model Size	Q4_K_M	Q8_0	F16
7B	~4.5 GB	~7.5 GB	~14 GB
13B	~8 GB	~14 GB	~26 GB
70B	~40 GB	~70 GB	~140 GB

If your model exceeds VRAM, GGUF inference via llama.cpp can offload some layers to CPU RAM. SafeTensors inference will auto-fallback to full CPU mode.

Training

Approximate VRAM needed for fine-tuning a 7B model:

Method	Minimum VRAM	Recommended
QLoRA (4-bit)	4 GB	8 GB
LoRA	6 GB	12 GB
SFT	8 GB	16 GB
DPO	8 GB	16 GB
Full Fine-Tune	16 GB	24+ GB

Layer surgery (remove/duplicate layers) is pure Rust and requires no GPU or Python — it works on any system.

Troubleshooting

Issue	Solution
GPU not detected	Update NVIDIA/Vulkan drivers
CUDA variant fails to install	Ensure NVIDIA drivers are 515+
Slow inference despite GPU	Check Settings to confirm CUDA/Vulkan variant is installed, not CPU
Out of memory (inference)	Use a smaller quantization (Q4_K_M instead of Q8_0) or switch to CPU
Out of memory (training)	Switch to QLoRA, reduce batch size or max_seq_length
SafeTensors shows CPU device	Re-run Convert module setup to reinstall PyTorch with CUDA
Training not using GPU	Verify CUDA is available in Settings → Training Environment

Getting Started

Modules

Guides

GPU Setup

GPU Setup

Auto-Detection

llama.cpp Variants

Python Environments (Training & Convert)

Training Environment

Convert Environment

VRAM Requirements

Inference

Training

Troubleshooting

Getting Started

Modules

Guides

​GPU Setup

​Auto-Detection

​llama.cpp Variants

​Python Environments (Training & Convert)

​Training Environment

​Convert Environment

​VRAM Requirements

​Inference

​Training

​Troubleshooting

GPU Setup

Auto-Detection

llama.cpp Variants

Python Environments (Training & Convert)

Training Environment

Convert Environment

VRAM Requirements

Inference

Training

Troubleshooting