Skip to main content

DataStudio (10)

DataStudio is ForgeAI’s dataset explorer. Load datasets from local files or HuggingFace, analyze column structure, detect templates, and preview data — all powered by native Rust parsing (including Parquet via Apache Arrow). DataStudio

Source Modes

DataStudio has two source modes, toggled via the source bar:
Browse and load dataset files from your local disk.
  1. Click BROWSE FILE
  2. Select a JSON, JSONL, CSV, or Parquet file
  3. Dataset loads automatically with metadata, column analysis, and data preview

Supported Formats

FormatParserNotes
JSONRust serde_jsonArray of objects
JSONLRust serde_jsonOne JSON object per line
CSVRust CSV readerComma-separated with headers
ParquetApache Arrow + ParquetColumnar binary format (most HF datasets)

Dataset Metadata

After loading, a metadata panel shows:
FieldDescription
PATHFull file path
FORMATDetected format (JSON/JSONL/CSV/PARQUET)
ROWSTotal row count
SIZEFile size
COLUMNSNumber of columns
TEMPLATEAuto-detected template (if applicable)

Template Detection

ForgeAI auto-detects common dataset templates:
TemplateDescriptionKey Columns
AlpacaStanford Alpaca formatinstruction, input, output
ShareGPTMulti-turn conversationsconversations
ChatMLChat markup languagemessages
DPODirect Preference Optimizationprompt, chosen, rejected
TextPlain texttext
Prompt/CompletionOpenAI formatprompt, completion

Column Analysis

Each column is analyzed and displayed:
MetricDescription
NameColumn name
DtypeData type (STRING, INTEGER, FLOAT, OBJECT, NULL)
ValidCount of non-null values
NullCount of null/empty values (highlighted if > 0)
Avg LengthAverage string length (for string columns)

Data Preview

A scrollable table showing the first rows of the dataset. Long cell values are truncated with ellipsis for readability.

Workflow

1

Choose source

Toggle between LOCAL and HUGGINGFACE mode
2

Load dataset

Browse a local file or fetch + download from HuggingFace
3

Review analysis

Check metadata, template detection, and column analysis
4

Use in training

The dataset path can be used directly in the Training module