Skip to content

Format Command

The format command allows you to apply formatters to existing datasets without needing to regenerate them. This is useful when you want to transform already generated data into different training formats.

Usage

deepfabric format INPUT_FILE [OPTIONS]

Arguments

  • INPUT_FILE - Path to the input JSONL dataset file to format

Options

  • -c, --config-file PATH - YAML config file containing formatter settings
  • -f, --formatter [im_format|unsloth|alpaca|chatml|grpo] - Quick formatter selection with default settings
  • -o, --output TEXT - Output file path (default: input_file_formatter.jsonl)
  • --help - Show help message

Examples

Using a specific formatter

Apply the im_format formatter with default settings:

deepfabric format dataset.jsonl -f im_format

This creates dataset_im_format.jsonl with the formatted output.

Using a custom output path

deepfabric format dataset.jsonl -f alpaca -o training_data.jsonl

Using a configuration file

For more control over formatter settings, use a YAML configuration file:

deepfabric format dataset.jsonl -c formatter_config.yaml

Example formatter_config.yaml:

dataset:
  formatters:
    - name: "im_format_training"
      template: "builtin://im_format.py"
      output: "formatted_output.jsonl"
      config:
        include_system: true
        system_message: "You are a helpful assistant."
        roles_map:
          user: "user"
          assistant: "assistant"
          system: "system"

Supported Formatters

im_format

Formats conversations using <|im_start|> and <|im_end|> delimiters, compatible with ChatML and similar formats.

Default configuration:

include_system: true
system_message: "You are a helpful assistant."
roles_map:
  user: "user"
  assistant: "assistant"
  system: "system"

Output example:

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is Python?<|im_end|>
<|im_start|>assistant
Python is a high-level programming language.<|im_end|>

alpaca

Formats data for Alpaca-style instruction tuning.

Default configuration:

instruction_template: "### Instruction:\n{instruction}\n\n### Response:"
include_empty_input: false

chatml

Formats data in ChatML format (structured or text).

Default configuration:

output_format: "text"
start_token: "<|im_start|>"
end_token: "<|im_end|>"
include_system: false

grpo

Formats data for GRPO (Guided Reasoning Process Optimization) training.

Default configuration:

reasoning_start_tag: "<start_working_out>"
reasoning_end_tag: "<end_working_out>"
solution_start_tag: "<SOLUTION>"
solution_end_tag: "</SOLUTION>"

Input Format

The command expects a JSONL file where each line is a JSON object. Supported formats include:

  1. Question-Answer format:

    {
      "question": "What is recursion?",
      "answer": "Recursion is a programming technique..."
    }
    

  2. Messages format:

    {
      "messages": [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"}
      ]
    }
    

  3. Instruction format:

    {
      "instruction": "Write a function to calculate factorial",
      "input": "n = 5",
      "output": "def factorial(n):..."
    }
    

Working with HuggingFace Datasets

The format command works seamlessly with datasets downloaded from HuggingFace Hub. Many HF datasets come in compatible formats:

For datasets with messages field (e.g., chat datasets):

# Download dataset using datasets library or git
huggingface-cli download microsoft/DialoGPT-medium --repo-type dataset

# Convert to JSONL if needed
python -c "
from datasets import load_dataset
ds = load_dataset('microsoft/orca-math-word-problems-200k')
ds['train'].to_json('orca_math.jsonl')
"

# Apply formatter
deepfabric format orca_math.jsonl -f im_format

For datasets with instruction format (e.g., Alpaca-style):

# Many HF datasets use instruction/input/output format
deepfabric format alpaca_dataset.jsonl -f im_format

Common HuggingFace dataset formats supported: - OpenAI ChatML format (messages field) - Alpaca format (instruction, input, output) - ShareGPT format (conversations) - Q&A format (question, answer or response)

Example conversion workflow:

# 1. Download from HuggingFace
huggingface-cli download tatsu-lab/alpaca --repo-type dataset

# 2. Convert to JSONL (if not already)
python convert_hf_to_jsonl.py

# 3. Apply multiple formatters
deepfabric format alpaca.jsonl -f im_format -o alpaca_chatml.jsonl
deepfabric format alpaca.jsonl -f grpo -o alpaca_grpo.jsonl

Workflow Example

  1. Generate a dataset:

    deepfabric generate config.yaml
    

  2. Apply different formatters to the same dataset:

    # For ChatML training
    deepfabric format dataset_raw.jsonl -f im_format -o dataset_chatml.jsonl
    
    # For Alpaca training
    deepfabric format dataset_raw.jsonl -f alpaca -o dataset_alpaca.jsonl
    
    # For GRPO training
    deepfabric format dataset_raw.jsonl -f grpo -o dataset_grpo.jsonl
    

This allows you to prepare the same dataset for different training frameworks without regenerating the data.