Supported Formats

ForgeAI works with two primary model formats (GGUF and SafeTensors) and four dataset formats (JSON, JSONL, CSV, Parquet).

Model Formats

GGUF

The GGUF (GPT-Generated Unified Format) is the standard format for llama.cpp and its ecosystem. Characteristics:

Single file containing weights, metadata, and tokenizer
Supports quantized dtypes (Q2_K through Q8_0, plus F16/F32)
Used by llama.cpp, Ollama, LM Studio, KoboldCpp

ForgeAI support:

Module	Support
Load	Single file
Inspect	Full analysis with capabilities
Compress	Quantize to any level
Hub	Download from HuggingFace
Convert	Output format
Training	Fine-tune and surgery
M-DNA	Parent and output
Test	Via llama.cpp

SafeTensors

SafeTensors is HuggingFace’s format for storing model weights safely and efficiently. Characteristics:

Typically paired with config.json and tokenizer files in a directory
Large models are sharded across multiple .safetensors files
Supports F16, BF16, F32 dtypes
Used by HuggingFace Transformers, vLLM, ExLlamaV2, MLX

ForgeAI support:

Module	Support
Load	Single file or sharded folder
Inspect	Full analysis with capabilities
Compress	Not supported (convert to GGUF first)
Hub	Download from HuggingFace
Convert	Input format (→ GGUF)
Training	Fine-tune and surgery
M-DNA	Parent and output
Test	Via HuggingFace Transformers

Folder Structure

SafeTensors models from HuggingFace typically have this structure:

model-name/
├── config.json              # Architecture config (required)
├── tokenizer.json           # Tokenizer data
├── tokenizer_config.json    # Tokenizer settings
├── tokenizer.model          # SentencePiece model (some architectures)
├── special_tokens_map.json  # Special token definitions
├── generation_config.json   # Generation defaults
├── model.safetensors        # Weights (single file)
└── model-00001-of-00003.safetensors  # Or sharded

Model Format Comparison

Feature	GGUF	SafeTensors
File count	Single file	Multiple files
Metadata	Embedded in file	Separate JSON files
Tokenizer	Embedded	Separate files
Quantization	Native (Q2–Q8)	F16/BF16/F32 only
Sharding	No	Yes
Ecosystem	llama.cpp	HuggingFace

Dataset Formats

JSON

Array of objects in a single file.

[
  {"instruction": "...", "input": "...", "output": "..."},
  {"instruction": "...", "input": "...", "output": "..."}
]

JSONL

One JSON object per line (JSON Lines format).

{"instruction": "...", "input": "...", "output": "..."}
{"instruction": "...", "input": "...", "output": "..."}

CSV

Comma-separated values with a header row.

instruction,input,output
"...", "...", "..."

Parquet

Apache Parquet columnar binary format. Most HuggingFace datasets use this format. ForgeAI reads Parquet natively in Rust using Apache Arrow — no Python required.

Dataset Format Support

Module	JSON	JSONL	CSV	Parquet
DataStudio	✓	✓	✓	✓
Training	✓	✓	✓	✓
HuggingFace	✓	✓	✓	✓

Dataset Template Detection

ForgeAI auto-detects these common dataset templates:

Template	Key Columns
Alpaca	instruction, input, output
ShareGPT	conversations
ChatML	messages
DPO	prompt, chosen, rejected
Text	text
Prompt/Completion	prompt, completion

Getting Started

Modules

Guides

Supported Formats

Supported Formats

Model Formats

GGUF

SafeTensors

Folder Structure

Model Format Comparison

Dataset Formats

JSON

JSONL

CSV

Parquet

Dataset Format Support

Dataset Template Detection

Getting Started

Modules

Guides

​Supported Formats

​Model Formats

​GGUF

​SafeTensors

​Folder Structure

​Model Format Comparison

​Dataset Formats

​JSON

​JSONL

​CSV

​Parquet

​Dataset Format Support

​Dataset Template Detection

Supported Formats

Model Formats

GGUF

SafeTensors

Folder Structure

Model Format Comparison

Dataset Formats

JSON

JSONL

CSV

Parquet

Dataset Format Support

Dataset Template Detection