Formatter API Reference¶
This document provides complete API reference for the DeepFabric formatter system.
Core Classes¶
BaseFormatter¶
Abstract base class for all formatters.
Constructor¶
Initialize the formatter with configuration.
Parameters:
- config
(dict, optional): Configuration dictionary specific to this formatter
Abstract Methods¶
format()¶
Transform the dataset to the target format.
Parameters:
- dataset
(List[Dict]): List of samples in DeepFabric's internal format
Returns:
- List[Dict]
: List of samples in the formatter's target format
Raises:
- FormatterError
: If formatting fails
Virtual Methods¶
validate()¶
Validate that an entry meets the formatter's requirements.
Parameters:
- entry
(Dict): A single dataset entry to validate
Returns:
- bool
: True if the entry is valid, False otherwise
Default Implementation:
get_description()¶
Get a human-readable description of this formatter.
Returns:
- str
: String description of what this formatter does
get_supported_formats()¶
Get list of input formats this formatter can handle.
Returns:
- List[str]
: List of supported input format names
Default Implementation:
FormatterError¶
Exception raised when formatting operations fail.
from deepfabric.formatters.base import FormatterError
raise FormatterError("Detailed error message")
Inherits from Python's built-in Exception
class.
Registry System¶
FormatterRegistry¶
Registry for managing formatter loading and instantiation.
Methods¶
load_formatter()¶
Load and instantiate a formatter from a template string.
Parameters:
- template
(str): Template string like "builtin://grpo.py" or "file://./my_formatter.py"
- config
(dict, optional): Configuration dictionary to pass to the formatter
Returns:
- BaseFormatter
: Instantiated formatter instance
Raises:
- FormatterError
: If the formatter cannot be loaded or instantiated
Example:
# Load built-in formatter
grpo = registry.load_formatter("builtin://grpo.py", {
"reasoning_start_tag": "<think>",
"reasoning_end_tag": "</think>"
})
# Load custom formatter
custom = registry.load_formatter("file://./my_formatter.py", {
"custom_option": "value"
})
list_builtin_formatters()¶
List all available built-in formatters.
Returns:
- List[str]
: List of built-in formatter names
Example:
clear_cache()¶
Clear the formatter cache. Useful for development when formatters are being modified.
Dataset Integration¶
Dataset.apply_formatters()¶
Apply formatters to a dataset and return formatted datasets.
Parameters:
- formatter_configs
(List[Dict]): List of formatter configuration dictionaries
Returns:
- Dict[str, Dataset]
: Dictionary mapping formatter names to formatted Dataset instances
Raises:
- FormatterError
: If any formatter fails to process the dataset
Configuration Format:
formatter_config = {
"name": "grpo_math",
"template": "builtin://grpo.py",
"config": {
"reasoning_start_tag": "<think>",
"reasoning_end_tag": "</think>"
},
"output": "grpo_formatted.jsonl" # Optional
}
Example:
from deepfabric.dataset import Dataset
dataset = Dataset.from_jsonl("input.jsonl")
formatter_configs = [
{
"name": "grpo",
"template": "builtin://grpo.py",
"config": {"validate_numerical": True},
"output": "grpo_output.jsonl"
},
{
"name": "alpaca",
"template": "builtin://alpaca.py",
"config": {"include_empty_input": False},
"output": "alpaca_output.jsonl"
}
]
formatted_datasets = dataset.apply_formatters(formatter_configs)
# Access formatted datasets
grpo_dataset = formatted_datasets["grpo"]
alpaca_dataset = formatted_datasets["alpaca"]
Dataset.list_available_formatters()¶
List all available built-in formatters.
Returns:
- List[str]
: List of built-in formatter names
Configuration System¶
FormatterConfig¶
Pydantic model for formatter configuration.
from deepfabric.config import FormatterConfig
formatter_config = FormatterConfig(
name="my_formatter",
template="builtin://grpo.py",
config={"option": "value"},
output="output.jsonl"
)
Fields¶
name
(str, required): Unique identifier for this formatter instancetemplate
(str, required): Template path (builtin:// or file://)config
(Dict[str, Any], optional): Formatter-specific configuration optionsoutput
(str, optional): Output file path for this formatter
DeepFabricConfig.get_formatter_configs()¶
Get list of formatter configurations from the main configuration.
Returns:
- List[Dict]
: List of formatter configuration dictionaries
Example:
from deepfabric.config import DeepFabricConfig
config = DeepFabricConfig.from_yaml("config.yaml")
formatter_configs = config.get_formatter_configs()
# Apply to dataset
dataset.apply_formatters(formatter_configs)
Built-in Formatters¶
GrpoFormatter¶
Template: builtin://grpo.py
Mathematical reasoning formatter with configurable reasoning and solution tags.
Configuration Options¶
config = {
"reasoning_start_tag": str, # Default: "<start_working_out>"
"reasoning_end_tag": str, # Default: "<end_working_out>"
"solution_start_tag": str, # Default: "<SOLUTION>"
"solution_end_tag": str, # Default: "</SOLUTION>"
"system_prompt": str, # Default: Auto-generated
"validate_numerical": bool # Default: True
}
Supported Input Formats¶
messages
: Chat format with system/user/assistant rolesquestion_answer
: Q&A format with optional reasoningchain_of_thought
: Questions with reasoning tracesgeneric
: Any format with question/answer patterns
AlpacaFormatter¶
Template: builtin://alpaca.py
Instruction-following format for supervised fine-tuning.
Configuration Options¶
config = {
"instruction_field": str, # Default: "instruction"
"input_field": str, # Default: "input"
"output_field": str, # Default: "output"
"include_empty_input": bool, # Default: True
"instruction_template": str # Default: None
}
Supported Input Formats¶
messages
: Chat formatinstruction_output
: Direct instruction/output formatquestion_answer
: Q&A formatgeneric
: Any instruction-like patterns
ChatmlFormatter¶
Template: builtin://chatml.py
Conversation format with ChatML markup.
Configuration Options¶
config = {
"start_token": str, # Default: "<|im_start|>"
"end_token": str, # Default: "<|im_end|>"
"output_format": str, # Default: "structured" ("structured" or "text")
"default_system_message": str, # Default: "You are a helpful assistant."
"require_system_message": bool # Default: False
}
Supported Input Formats¶
messages
: Direct chat formatquestion_answer
: Q&A pairsinstruction_response
: Instruction-following patternsgeneric
: Any conversational patterns
Error Handling¶
FormatterError Exception¶
FormatterError
is the primary exception class used for all formatter-related errors. It can include optional details for debugging.
class FormatterError(Exception):
"""Exception raised when formatting operations fail."""
def __init__(self, message: str, details: dict | None = None):
super().__init__(message)
self.details = details or {}
This single exception type is raised for various failure scenarios: - Loading errors: When a formatter cannot be loaded from a template - Configuration errors: When formatter configuration is invalid - Processing errors: When formatting operations fail - Validation errors: When sample validation fails
Common Error Scenarios¶
Template Loading Errors¶
try:
formatter = registry.load_formatter("builtin://nonexistent.py")
except FormatterError as e:
print(f"Failed to load formatter: {e}")
# Error: Built-in formatter 'nonexistent' not found
Configuration Errors¶
try:
formatter = registry.load_formatter("builtin://grpo.py", {
"invalid_option": "value"
})
except FormatterError as e:
print(f"Configuration error: {e}")
Processing Errors¶
try:
formatted_data = formatter.format(invalid_dataset)
except FormatterError as e:
print(f"Processing failed: {e}")
# Error: Failed to format sample 5: Missing required field 'messages'
Performance Considerations¶
Caching¶
- Formatter classes are cached after first load
- Use
registry.clear_cache()
during development - Consider memory usage with large formatter caches
Memory Usage¶
- Formatters process entire datasets in memory
- For large datasets, consider batch processing
- Custom formatters can implement streaming
Validation Overhead¶
- Input validation adds processing time
- Output validation can be disabled for performance
- Custom validators should be efficient
Type Definitions¶
Common Types¶
from typing import Dict, List, Any, Optional
# Dataset sample
Sample = Dict[str, Any]
# Dataset
Dataset = List[Sample]
# Formatter configuration
FormatterConfig = Dict[str, Any]
# Template string
Template = str # "builtin://name.py" or "file://path.py"
Configuration Schema¶
# Complete formatter configuration
{
"name": str, # Required: formatter instance name
"template": str, # Required: formatter template path
"config": Dict[str, Any], # Optional: formatter-specific config
"output": Optional[str] # Optional: output file path
}
Migration Guide¶
From Previous Versions¶
If you have existing formatter code, update it to use the new API:
# Old style (deprecated)
class OldFormatter:
def transform(self, data):
return data
# New style (recommended)
from deepfabric.formatters.base import BaseFormatter
class NewFormatter(BaseFormatter):
def format(self, dataset: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
return dataset
Configuration Updates¶
Update YAML configuration to use the new formatter section: