Built-in Formatter Reference¶

DeepFabric includes several built-in formatters for popular training frameworks and methodologies. This document provides comprehensive reference for all built-in formatters.

Im Format Formatter¶

Template: builtin://im_format.py Use Case: ChatML-compatible training with <|im_start|> and <|im_end|> delimiters

Description¶

The Im Format formatter transforms datasets into the format used by models that expect conversation delimiters with <|im_start|> and <|im_end|> tokens. This format is widely used for chat models and is compatible with ChatML and similar conversation formats.

Configuration Options¶

config:
  include_system: true                       # Default: false
  system_message: "Custom system message"    # Default: None
  roles_map:                                # Default: shown below
    user: "user"
    assistant: "assistant"
    system: "system"

Input Formats Supported¶

Messages: Chat format with role/content pairs
Q&A: Question and answer fields
Instruction: Instruction/input/output format
Direct: User/assistant fields
Generic: Any format with extractable conversation patterns

Output Format¶

<|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
What is Python?<|im_end|>
<|im_start|>assistant
Python is a high-level, interpreted programming language known for its simplicity and readability.<|im_end|>

Example Configuration¶

formatters:
- name: "chatml_training"
  template: "builtin://im_format.py"
  config:
    include_system: true
    system_message: |
      You are an expert programming assistant.
      Provide clear, accurate, and practical answers.
    roles_map:
      user: "user"
      assistant: "assistant"
      system: "system"
  output: "chatml_dataset.jsonl"

Unsloth Formatter¶

Template: builtin://unsloth.py Use Case: Training with Unsloth framework using conversations format

Description¶

The Unsloth formatter transforms datasets into the conversations format expected by Unsloth training notebooks. This enables seamless integration with Unsloth's training pipeline and chat templates.

Configuration Options¶

config:
  include_system: false                      # Default: false
  system_message: "Custom system message"    # Default: None
  roles_map:                                # Default: shown below
    user: "user"
    assistant: "assistant"
    system: "system"

Input Formats Supported¶

Messages: Chat format with role/content pairs
Q&A: Question and answer fields
Instruction: Instruction/input/output format
Direct: User/assistant fields
Generic: Any format with extractable conversation patterns

Output Format¶

{
  "conversations": [
    {"role": "user", "content": "What is Python?"},
    {"role": "assistant", "content": "Python is a high-level, interpreted programming language known for its simplicity and readability."}
  ]
}

Example Configuration¶

formatters:
- name: "unsloth_training"
  template: "builtin://unsloth.py"
  config:
    include_system: false  # Unsloth applies system messages via chat templates
    roles_map:
      user: "user"
      assistant: "assistant"
  output: "unsloth_dataset.jsonl"

Integration with Unsloth Notebooks¶

After formatting with this formatter and uploading to HuggingFace Hub, use directly in Unsloth notebooks:

# Replace the default dataset
dataset = load_dataset("your-username/your-dataset", split="train")

# The rest of the notebook works unchanged
dataset = standardize_data_formats(dataset)
dataset = dataset.map(formatting_prompts_func, batched=True)

GRPO Formatter¶

Template: builtin://grpo.py Use Case: Mathematical reasoning model training with GRPO (Generalized Reward-based Policy Optimization)

Description¶

The GRPO formatter transforms datasets for mathematical reasoning training, wrapping reasoning processes in configurable tags and ensuring numerical answers are extractable for reward functions.

Configuration Options¶

config:
  reasoning_start_tag: "<start_working_out>"  # Default: "<start_working_out>"
  reasoning_end_tag: "<end_working_out>"      # Default: "<end_working_out>"
  solution_start_tag: "<SOLUTION>"            # Default: "<SOLUTION>"
  solution_end_tag: "</SOLUTION>"             # Default: "</SOLUTION>"
  system_prompt: "Custom system prompt..."    # Default: Auto-generated
  validate_numerical: true                    # Default: true

Input Formats Supported¶

Messages: Chat format with system/user/assistant roles
Q&A: Question and answer fields with optional reasoning
Chain of Thought: Questions with reasoning traces
Generic: Any format with identifiable question/answer patterns

Output Format¶

{
  "messages": [
    {
      "role": "system",
      "content": "You are given a problem. Think about the problem and provide your working out. Place it between <start_working_out> and <end_working_out>. Then, provide your solution between <SOLUTION> and </SOLUTION>."
    },
    {
      "role": "user",
      "content": "What is 2 + 2?"
    },
    {
      "role": "assistant",
      "content": "<start_working_out>I need to add 2 and 2. This is basic addition.<end_working_out><SOLUTION>4</SOLUTION>"
    }
  ]
}

Example Configuration¶

formatters:
- name: "grpo_math"
  template: "builtin://grpo.py"
  config:
    reasoning_start_tag: "<think>"
    reasoning_end_tag: "</think>"
    solution_start_tag: "<answer>"
    solution_end_tag: "</answer>"
    validate_numerical: true
  output: "grpo_dataset.jsonl"

Alpaca Formatter¶

Template: builtin://alpaca.py Use Case: Instruction-following fine-tuning with the Stanford Alpaca format

Description¶

The Alpaca formatter transforms datasets into the standard instruction-following format used by Stanford Alpaca and many other instruction-tuning projects.

Configuration Options¶

config:
  instruction_field: "instruction"           # Default: "instruction"
  input_field: "input"                      # Default: "input"
  output_field: "output"                    # Default: "output"
  include_empty_input: true                 # Default: true
  instruction_template: "Custom template"   # Default: None

Input Formats Supported¶

Messages: Chat format (system → instruction, user → input, assistant → output)
Direct: Already has instruction/input/output fields
Q&A: Question/answer pairs with optional context
Generic: Any format with instruction-like patterns

Output Format¶

{
  "instruction": "Solve this math problem:",
  "input": "What is 15 + 27?",
  "output": "To solve 15 + 27, I'll add the numbers: 15 + 27 = 42"
}

Example Configuration¶

formatters:
- name: "alpaca_instruct"
  template: "builtin://alpaca.py"
  config:
    instruction_template: "### Instruction:\n{instruction}\n\n### Response:"
    include_empty_input: false
  output: "alpaca_dataset.jsonl"

Harmony Formatter¶

Template: builtin://harmony.py Use Case: OpenAI Harmony format for gpt-oss models with channels and TypeScript-style tool definitions

Description¶

The Harmony formatter transforms datasets into the OpenAI Harmony Response Format, which is designed for the gpt-oss open-source models. It features a sophisticated role hierarchy, channel-based message organization (final, analysis, commentary), and TypeScript-style function definitions for tool calling.

Configuration Options¶

config:
  start_token: "<|start|>"                      # Default: "<|start|>"
  end_token: "<|end|>"                          # Default: "<|end|>"
  message_token: "<|message|>"                  # Default: "<|message|>"
  output_format: "text"                         # Default: "text" (or "structured")
  default_channel: "final"                      # Default: "final" (analysis/commentary/final)
  include_developer_role: false                 # Default: false
  developer_instructions: "Custom instructions" # Default: None
  system_message: "You are ChatGPT..."         # Default: "You are ChatGPT, a large language model trained by OpenAI."
  reasoning_level: "high"                       # Default: "high" (none/low/medium/high)
  knowledge_cutoff: "2024-01"                  # Default: "2024-01"
  current_date: "2024-03-15"                   # Default: None (optional, for deterministic output)
  include_metadata: true                        # Default: true
  tool_namespace: "functions"                   # Default: "functions"

Role Hierarchy¶

The Harmony format enforces a strict role hierarchy (highest to lowest priority): 1. system - System instructions and metadata 2. developer - Developer instructions and tool definitions 3. user - User messages 4. assistant - Model responses with channel support 5. tool - Tool responses

Channels¶

Assistant messages can be assigned to different channels: - final: User-facing responses (default) - analysis: Internal chain-of-thought reasoning (not safe for user display) - commentary: Function tool calls and preambles

Input Formats Supported¶

Messages: Chat format with role/content pairs and optional tool calls
Q&A: Question/answer pairs with optional chain_of_thought
Instruction: Instruction/output patterns
Generic: Any format with extractable conversation patterns

Output Formats¶

Text Format (output_format: "text"):

<|start|>system<|message|>
You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-01
Current date: 2024-03-15
Reasoning: high
# Valid channels: analysis, commentary, final
<|end|>
<|start|>developer<|message|>
# Instructions
Always provide detailed explanations

# Tools
namespace functions {
  type get_weather = (_: { location: string, unit?: "celsius" | "fahrenheit" }) => any;
}
<|end|>
<|start|>user<|message|>
What's the weather in London?
<|end|>
<|start|>assistant<|channel|>analysis<|message|>
I need to check the weather in London using the weather tool.
<|end|>
<|start|>assistant<|channel|>commentary<|recipient|>functions.get_weather<|message|>
{"location": "London", "unit": "celsius"}
<|end|>
<|start|>tool<|message|>
{"temperature": 18, "condition": "cloudy"}
<|end|>
<|start|>assistant<|channel|>final<|message|>
The weather in London is currently 18°C with cloudy conditions.
<|end|>

Structured Format (output_format: "structured"):

href="#__codelineno-15-1">{ "messages": [ { "role": "system", "content": "You are ChatGPT, a large language model...\nKnowledge cutoff: 2024-01\nReasoning: high\n# Valid channels: analysis, commentary, final", "channel": null, "recipient": null }, { "role": "developer", "content": "# Instructions\nAlways provide detailed explanations\n\n# Tools\nnamespace functions {\n type get_weather = (_: { location: string, unit?: \"celsius\" | \"fahrenheit\" }) => any;\n}", "channel": null, "recipient": null }, { "role": "user", "content": "What's the weather in London?", "channel": null, "recipient": null }, { "role": "assistant", "content": "I need to check the weather in London using the weather tool.", "channel": "analysis", "recipient": null }, { "role": "assistant", "content": "{\"location\": \"London\", \"unit\": \"celsius\"}", "channel": "commentary", "recipient": "functions.get_weather" }, { "role": "tool", "content": "{\"temperature\": 18, \"condition\": \"cloudy\"}", "channel": null, "recipient": null }, { "role": "assistant", "content": "The weather in London is currently 18°C with cloudy conditions.", "channel": "final", "recipient": null } ] }

Tool Definitions¶

Tools are defined using TypeScript-style type syntax in the developer message:

namespace functions {
  type calculator = (_: {
    operation: "add" | "subtract" | "multiply" | "divide",
    a: number,
    b: number
  }) => any;

  type web_search = (_: {
    query: string,
    limit?: number
  }) => any;
}

Example Configurations¶

Basic Chat Configuration:

formatters:
- name: "harmony_chat"
  template: "builtin://harmony.py"
  config:
    output_format: "text"
    default_channel: "final"
    include_metadata: true
  output: "harmony_chat.jsonl"

Advanced Configuration with Tools:

formatters:
- name: "harmony_tools"
  template: "builtin://harmony.py"
  config:
    output_format: "text"
    include_developer_role: true
    developer_instructions: |
      You are an expert assistant with access to various tools.
      Always think through your approach before using tools.
    reasoning_level: "high"
    default_channel: "final"
    tool_namespace: "functions"
    current_date: "2024-03-15"  # For deterministic output
  output: "harmony_tools.jsonl"

Chain-of-Thought Configuration:

formatters:
- name: "harmony_cot"
  template: "builtin://harmony.py"
  config:
    output_format: "structured"
    default_channel: "analysis"  # Default to analysis channel for reasoning
    reasoning_level: "high"
    include_metadata: true
  output: "harmony_cot.jsonl"

Special Features¶

Multiple Tool Calls: Handles multiple tool calls in a single message by creating separate messages for each tool call
Deterministic Output: Use current_date config to ensure reproducible outputs (no dynamic timestamps)
Tool Name Validation: Skips tools without names to prevent namespace conflicts
Flexible Channels: Automatically assigns channels based on message content (reasoning → analysis, tool calls → commentary)

ChatML Formatter¶

Template: builtin://chatml.py Use Case: Conversation format with clear role delineation using ChatML markup

Description¶

The ChatML formatter creates standardized conversation formats with special tokens for role boundaries, compatible with many modern chat-based training frameworks.

Configuration Options¶

config:
  start_token: "<|im_start|>"                    # Default: "<|im_start|>"
  end_token: "<|im_end|>"                        # Default: "<|im_end|>"
  output_format: "structured"                    # Default: "structured" (or "text")
  default_system_message: "You are helpful..."   # Default: "You are a helpful assistant."
  require_system_message: false                  # Default: false

Input Formats Supported¶

Messages: Direct chat format
Q&A: Question/answer pairs
Instruction-Response: Instruction-following patterns
Generic: Any conversational patterns

Output Formats¶

Structured Format (output_format: "structured"):

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there! How can I help you today?"}
  ]
}

Text Format (output_format: "text"):

{
  "text": "<|im_start|>system\nYou are a helpful assistant.\n<|im_end|>\n<|im_start|>user\nHello!\n<|im_end|>\n<|im_start|>assistant\nHi there! How can I help you today?\n<|im_end|>"
}

Example Configuration¶

formatters:
- name: "chatml_chat"
  template: "builtin://chatml.py"
  config:
    output_format: "text"
    require_system_message: true
    default_system_message: "You are a helpful AI assistant specialized in mathematics."
  output: "chatml_dataset.jsonl"

Choosing the Right Formatter¶

For Mathematical Reasoning Training¶

GRPO: When training models to show step-by-step reasoning with extractable answers
Harmony: For models that need to show internal reasoning (analysis channel) separate from final answers
Alpaca: For instruction-following with math problems
ChatML: For conversational math tutoring scenarios

For General Instruction Following¶

Alpaca: Standard instruction-following format
ChatML: When you need conversation context and role clarity
Harmony: For gpt-oss models with developer instructions and role hierarchy
Unsloth: When using Unsloth training notebooks with conversations format

For Chat and Dialogue¶

Harmony: Advanced format with channels, tool support, and role hierarchy for gpt-oss models
ChatML: Structured conversations with multiple turns
Im Format: ChatML-compatible format with <|im_start|>/<|im_end|> delimiters
Unsloth: Conversations format for Unsloth framework integration
Alpaca: Single-turn instruction-response pairs

For Tool/Function Calling¶

Harmony: TypeScript-style function definitions with channels for tool calls and responses
Custom formatters: For specific tool calling conventions

For Custom Requirements¶

Create a custom formatter that inherits from BaseFormatter.

Validation and Error Handling¶

All built-in formatters include:

Input Validation: Checks if the input data is compatible
Output Validation: Ensures the formatted output meets requirements
Error Messages: Clear error descriptions for debugging
Graceful Degradation: Handles edge cases without crashing

Performance Notes¶

Built-in formatters are optimized for both speed and memory efficiency
Large datasets are processed in streaming fashion when possible
Validation can be disabled for better performance in production
Formatter instances are cached for repeated use