Skip to main content

Supported Formats

ForgeAI works with two primary model formats (GGUF and SafeTensors) and four dataset formats (JSON, JSONL, CSV, Parquet).

Model Formats

GGUF

The GGUF (GPT-Generated Unified Format) is the standard format for llama.cpp and its ecosystem. Characteristics:
  • Single file containing weights, metadata, and tokenizer
  • Supports quantized dtypes (Q2_K through Q8_0, plus F16/F32)
  • Used by llama.cpp, Ollama, LM Studio, KoboldCpp
ForgeAI support:
ModuleSupport
LoadSingle file
InspectFull analysis with capabilities
CompressQuantize to any level
HubDownload from HuggingFace
ConvertOutput format
TrainingFine-tune and surgery
M-DNAParent and output
TestVia llama.cpp

SafeTensors

SafeTensors is HuggingFace’s format for storing model weights safely and efficiently. Characteristics:
  • Typically paired with config.json and tokenizer files in a directory
  • Large models are sharded across multiple .safetensors files
  • Supports F16, BF16, F32 dtypes
  • Used by HuggingFace Transformers, vLLM, ExLlamaV2, MLX
ForgeAI support:
ModuleSupport
LoadSingle file or sharded folder
InspectFull analysis with capabilities
CompressNot supported (convert to GGUF first)
HubDownload from HuggingFace
ConvertInput format (→ GGUF)
TrainingFine-tune and surgery
M-DNAParent and output
TestVia HuggingFace Transformers

Folder Structure

SafeTensors models from HuggingFace typically have this structure:
model-name/
├── config.json              # Architecture config (required)
├── tokenizer.json           # Tokenizer data
├── tokenizer_config.json    # Tokenizer settings
├── tokenizer.model          # SentencePiece model (some architectures)
├── special_tokens_map.json  # Special token definitions
├── generation_config.json   # Generation defaults
├── model.safetensors        # Weights (single file)
└── model-00001-of-00003.safetensors  # Or sharded

Model Format Comparison

FeatureGGUFSafeTensors
File countSingle fileMultiple files
MetadataEmbedded in fileSeparate JSON files
TokenizerEmbeddedSeparate files
QuantizationNative (Q2–Q8)F16/BF16/F32 only
ShardingNoYes
Ecosystemllama.cppHuggingFace

Dataset Formats

JSON

Array of objects in a single file.
[
  {"instruction": "...", "input": "...", "output": "..."},
  {"instruction": "...", "input": "...", "output": "..."}
]

JSONL

One JSON object per line (JSON Lines format).
{"instruction": "...", "input": "...", "output": "..."}
{"instruction": "...", "input": "...", "output": "..."}

CSV

Comma-separated values with a header row.
instruction,input,output
"...", "...", "..."

Parquet

Apache Parquet columnar binary format. Most HuggingFace datasets use this format. ForgeAI reads Parquet natively in Rust using Apache Arrow — no Python required.

Dataset Format Support

ModuleJSONJSONLCSVParquet
DataStudio
Training
HuggingFace

Dataset Template Detection

ForgeAI auto-detects these common dataset templates:
TemplateKey Columns
Alpacainstruction, input, output
ShareGPTconversations
ChatMLmessages
DPOprompt, chosen, rejected
Texttext
Prompt/Completionprompt, completion