Format Command¶
The format
command allows you to apply formatters to existing datasets without needing to regenerate them. This is useful when you want to transform already generated data into different training formats.
Usage¶
Arguments¶
INPUT_FILE
- Path to the input JSONL dataset file to format
Options¶
-c, --config-file PATH
- YAML config file containing formatter settings-f, --formatter [im_format|unsloth|alpaca|chatml|grpo]
- Quick formatter selection with default settings-o, --output TEXT
- Output file path (default:input_file_formatter.jsonl
)--help
- Show help message
Examples¶
Using a specific formatter¶
Apply the im_format
formatter with default settings:
This creates dataset_im_format.jsonl
with the formatted output.
Using a custom output path¶
Using a configuration file¶
For more control over formatter settings, use a YAML configuration file:
Example formatter_config.yaml
:
dataset:
formatters:
- name: "im_format_training"
template: "builtin://im_format.py"
output: "formatted_output.jsonl"
config:
include_system: true
system_message: "You are a helpful assistant."
roles_map:
user: "user"
assistant: "assistant"
system: "system"
Supported Formatters¶
im_format¶
Formats conversations using <|im_start|>
and <|im_end|>
delimiters, compatible with ChatML and similar formats.
Default configuration:
include_system: true
system_message: "You are a helpful assistant."
roles_map:
user: "user"
assistant: "assistant"
system: "system"
Output example:
<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is Python?<|im_end|>
<|im_start|>assistant
Python is a high-level programming language.<|im_end|>
alpaca¶
Formats data for Alpaca-style instruction tuning.
Default configuration:
chatml¶
Formats data in ChatML format (structured or text).
Default configuration:
grpo¶
Formats data for GRPO (Guided Reasoning Process Optimization) training.
Default configuration:
reasoning_start_tag: "<start_working_out>"
reasoning_end_tag: "<end_working_out>"
solution_start_tag: "<SOLUTION>"
solution_end_tag: "</SOLUTION>"
Input Format¶
The command expects a JSONL file where each line is a JSON object. Supported formats include:
-
Question-Answer format:
-
Messages format:
-
Instruction format:
Working with HuggingFace Datasets¶
The format command works seamlessly with datasets downloaded from HuggingFace Hub. Many HF datasets come in compatible formats:
For datasets with messages
field (e.g., chat datasets):
# Download dataset using datasets library or git
huggingface-cli download microsoft/DialoGPT-medium --repo-type dataset
# Convert to JSONL if needed
python -c "
from datasets import load_dataset
ds = load_dataset('microsoft/orca-math-word-problems-200k')
ds['train'].to_json('orca_math.jsonl')
"
# Apply formatter
deepfabric format orca_math.jsonl -f im_format
For datasets with instruction format (e.g., Alpaca-style):
# Many HF datasets use instruction/input/output format
deepfabric format alpaca_dataset.jsonl -f im_format
Common HuggingFace dataset formats supported:
- OpenAI ChatML format (messages
field)
- Alpaca format (instruction
, input
, output
)
- ShareGPT format (conversations
)
- Q&A format (question
, answer
or response
)
Example conversion workflow:
# 1. Download from HuggingFace
huggingface-cli download tatsu-lab/alpaca --repo-type dataset
# 2. Convert to JSONL (if not already)
python convert_hf_to_jsonl.py
# 3. Apply multiple formatters
deepfabric format alpaca.jsonl -f im_format -o alpaca_chatml.jsonl
deepfabric format alpaca.jsonl -f grpo -o alpaca_grpo.jsonl
Workflow Example¶
-
Generate a dataset:
-
Apply different formatters to the same dataset:
This allows you to prepare the same dataset for different training frameworks without regenerating the data.