YAML Configuration Guide for Chain of Thought¶
This guide provides comprehensive documentation for configuring Chain of Thought datasets using YAML files. DeepFabric's YAML configuration system allows you to define topic generation, data generation parameters, and output settings in a declarative format.
Configuration Structure Overview¶
# High-level structure of CoT configuration
dataset_system_prompt: "System prompt for the assistant"
topic_tree: # OR topic_graph:
# Topic generation settings
data_engine:
# Data generation settings
# CoT-specific parameters
dataset:
creation:
# Generation control parameters
save_as: "output.jsonl"
huggingface: # Optional
# HuggingFace Hub upload settings
Complete Configuration Examples¶
Free-text CoT Configuration¶
# free-text-math-reasoning.yaml
# Optimized for mathematical word problems with natural language reasoning
dataset_system_prompt: "You are a helpful math tutor who explains problems step-by-step with clear reasoning."
# Topic generation using Tree structure (hierarchical)
topic_tree:
topic_prompt: "Elementary and middle school mathematics including arithmetic, basic algebra, geometry, and word problems"
# LLM settings for topic generation
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.7 # Higher creativity for topic diversity
# Tree structure parameters
degree: 3 # 3 subtopics per node
depth: 2 # 2 levels deep
# Output
save_as: "math_topics.jsonl"
# Data generation engine
data_engine:
instructions: "Create clear mathematical word problems that require step-by-step analytical thinking to solve."
generation_system_prompt: "You are a mathematics educator creating practice problems with detailed reasoning explanations."
# LLM settings for data generation
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.3 # Lower temperature for consistent reasoning
max_retries: 4
# Chain of Thought specific settings
conversation_type: "cot_freetext"
reasoning_style: "mathematical"
# Dataset assembly
dataset:
creation:
num_steps: 15 # Generate 15 examples
batch_size: 1 # Process one at a time
sys_msg: false # Free-text CoT doesn't use system messages
save_as: "math_reasoning_freetext.jsonl"
# Optional: Upload to HuggingFace Hub
# huggingface:
# repository: "username/math-reasoning-cot"
# tags: ["mathematics", "reasoning", "chain-of-thought", "deepfabric"]
Structured CoT Configuration¶
# structured-cs-education.yaml
# Optimized for computer science educational dialogues
dataset_system_prompt: "You are a computer science instructor who guides students through systematic problem-solving approaches."
# Topic generation using Graph structure (interconnected)
topic_graph:
topic_prompt: "Computer science fundamentals including algorithms, data structures, programming concepts, and computational thinking"
# LLM settings
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.6
# Graph structure (more complex than tree)
degree: 2
depth: 3
# Output
save_as: "cs_topics_graph.json"
# Data generation with structured conversations
data_engine:
instructions: "Create educational conversations where students learn computer science concepts through guided discovery and systematic reasoning."
generation_system_prompt: "You are a CS instructor creating realistic teaching dialogues with explicit reasoning steps."
# LLM settings
provider: "openai"
model: "gpt-4o-mini" # Consider gpt-4o for more complex conversations
temperature: 0.4
max_retries: 3
# Structured CoT settings
conversation_type: "cot_structured"
reasoning_style: "logical"
# Dataset assembly
dataset:
creation:
num_steps: 8 # Fewer steps due to complexity
batch_size: 1
sys_msg: true # Include system messages in conversations
save_as: "cs_education_structured.jsonl"
# HuggingFace upload with rich metadata
huggingface:
repository: "deepfabric/cs-education-structured-cot"
tags:
- "computer-science"
- "education"
- "algorithms"
- "structured-reasoning"
- "chain-of-thought"
- "deepfabric"
description: "Educational computer science conversations with structured reasoning traces"
Hybrid CoT Configuration¶
# hybrid-scientific-reasoning.yaml
# Optimized for complex scientific problems requiring multi-modal reasoning
dataset_system_prompt: "You are a scientific expert who combines intuitive insights with rigorous systematic analysis."
# Topic generation for complex scientific domains
topic_tree:
topic_prompt: "Advanced scientific problems in physics, chemistry, biology, and interdisciplinary research requiring both intuitive understanding and systematic analysis"
# Higher-capability model for complex topics
provider: "openai"
model: "gpt-4o"
temperature: 0.5
# Moderate complexity
degree: 2
depth: 2
save_as: "scientific_topics.jsonl"
# Data generation requiring dual reasoning modes
data_engine:
instructions: "Create challenging scientific problems that require both intuitive conceptual understanding and systematic step-by-step analysis."
generation_system_prompt: "You are a research scientist who excels at combining natural scientific insight with rigorous methodical reasoning."
# Premium model recommended for hybrid reasoning
provider: "openai"
model: "gpt-4o" # Higher capability needed
temperature: 0.3
max_retries: 5 # More retries due to complexity
# Hybrid CoT settings
conversation_type: "cot_hybrid"
reasoning_style: "logical" # Could also be "general" for interdisciplinary
# Dataset assembly (smaller scale due to complexity)
dataset:
creation:
num_steps: 5 # Fewer due to high complexity and cost
batch_size: 1 # Always 1 for hybrid format
sys_msg: false
save_as: "scientific_reasoning_hybrid.jsonl"
Parameter Reference¶
Global Parameters¶
dataset_system_prompt
¶
- Type: String
- Required: Yes
- Description: System prompt that defines the assistant's role and behavior
- Best Practices:
- Keep concise but descriptive
- Match the reasoning style and domain
- Emphasize step-by-step thinking for CoT formats
# Examples by format
dataset_system_prompt: "You are a helpful teacher who shows step-by-step reasoning." # Free-text
dataset_system_prompt: "You are an instructor who guides students systematically." # Structured
dataset_system_prompt: "You are an expert who combines intuition with analysis." # Hybrid
Topic Generation Parameters¶
You can use either topic_tree
(hierarchical) or topic_graph
(interconnected):
topic_prompt
¶
- Type: String
- Required: Yes
- Description: Root prompt that defines the domain and scope of topics
- Best Practices:
- Be specific about the domain and complexity level
- Include examples of the types of problems desired
- Mention target audience if relevant
# Domain-specific examples
topic_prompt: "Elementary mathematics word problems suitable for grades 3-6"
topic_prompt: "Computer science algorithms and data structures for undergraduate level"
topic_prompt: "Advanced physics problems requiring multi-step problem solving"
provider
and model
¶
- Type: String
- Required: Yes
- Description: LLM provider and model for topic generation
- Options:
openai
:gpt-4o
,gpt-4o-mini
,gpt-4-turbo
,gpt-3.5-turbo
anthropic
:claude-3-opus
,claude-3-sonnet
,claude-3-haiku
gemini
:gemini-pro
,gemini-2.5-flash-lite
ollama
:mistral:latest
,llama3:latest
, etc.
# Recommended combinations
provider: "openai"
model: "gpt-4o-mini" # Good balance of quality and cost
provider: "openai"
model: "gpt-4o" # Higher quality for complex topics
provider: "anthropic"
model: "claude-3-sonnet" # Alternative high-quality option
temperature
¶
- Type: Float (0.0 - 2.0)
- Default: 0.7
- Description: Controls creativity in topic generation
- Recommendations:
0.5-0.7
: Balanced creativity and consistency0.7-1.0
: More diverse topics0.2-0.5
: More focused, consistent topics
degree
and depth
¶
- Type: Integer
- Description: Control topic tree/graph structure
- degree: Number of subtopics per node
- depth: Number of levels in the structure
# Small dataset (4-9 topics)
degree: 2
depth: 2
# Medium dataset (8-27 topics)
degree: 3
depth: 2
# Large dataset (16-64 topics)
degree: 4
depth: 3
Data Engine Parameters¶
Core Generation Settings¶
data_engine:
instructions: "High-level guidance for the type of problems to create"
generation_system_prompt: "Specific role definition for the data generation model"
provider: "openai"
model: "gpt-4o-mini"
temperature: 0.3 # Lower than topic generation
max_retries: 4
Chain of Thought Specific Settings¶
conversation_type
¶
- Type: Enum
- Required: Yes for CoT
- Options:
"cot_freetext"
: Natural language reasoning"cot_structured"
: Conversations with reasoning traces"cot_hybrid"
: Both natural and structured reasoning- Default:
"basic"
(non-CoT)
reasoning_style
¶
- Type: Enum
- Required: No
- Options:
"mathematical"
: Numerical calculations, algebraic reasoning"logical"
: Premise-conclusion chains, formal logic"general"
: Flexible reasoning adapting to problem type- Default:
"general"
# Style-specific configurations
conversation_type: "cot_freetext"
reasoning_style: "mathematical" # For math word problems
conversation_type: "cot_structured"
reasoning_style: "logical" # For CS algorithms
conversation_type: "cot_hybrid"
reasoning_style: "general" # For interdisciplinary problems
Dataset Assembly Parameters¶
dataset:
creation:
num_steps: 10 # Number of examples to generate
batch_size: 1 # Samples per batch (recommend 1 for CoT)
sys_msg: false # Whether to include system messages
save_as: "output.jsonl" # Output filename
num_steps
¶
- Type: Integer
- Description: Number of examples to generate
- Recommendations by format:
- Free-text CoT: 10-50 (efficient generation)
- Structured CoT: 5-20 (more complex)
- Hybrid CoT: 3-10 (most complex and expensive)
batch_size
¶
- Type: Integer
- Default: 1
- Description: Number of samples to generate per batch
- Recommendation: Always use 1 for CoT formats for best quality
sys_msg
¶
- Type: Boolean
- Description: Whether to include system messages in the final dataset
- Recommendations:
- Free-text CoT:
false
(no conversations) - Structured CoT:
true
(part of conversation) - Hybrid CoT:
false
(standalone reasoning)
HuggingFace Integration¶
huggingface:
repository: "username/dataset-name"
description: "Dataset description for the Hub"
tags:
- "chain-of-thought"
- "reasoning"
- "mathematics" # Domain-specific tags
- "deepfabric"
private: false # Set to true for private repos
license: "mit" # Optional license specification
Configuration Best Practices¶
Model Selection Guidelines¶
For Topic Generation¶
- Simple domains:
gpt-4o-mini
,claude-3-haiku
- Complex domains:
gpt-4o
,claude-3-sonnet
- Interdisciplinary:
gpt-4o
,claude-3-opus
For Data Generation¶
- Free-text CoT:
gpt-4o-mini
sufficient - Structured CoT:
gpt-4o-mini
orgpt-4o
- Hybrid CoT:
gpt-4o
recommended
Temperature Guidelines¶
# Conservative approach (more consistent)
topic_tree:
temperature: 0.6
data_engine:
temperature: 0.2
# Balanced approach (recommended)
topic_tree:
temperature: 0.7
data_engine:
temperature: 0.3
# Creative approach (more diverse)
topic_tree:
temperature: 0.8
data_engine:
temperature: 0.4
Cost Optimization¶
# Budget-friendly configuration
provider: "openai"
model: "gpt-4o-mini" # Most cost-effective
conversation_type: "cot_freetext" # Simplest format
num_steps: 10 # Moderate dataset size
batch_size: 1
temperature: 0.3 # Reduce need for retries
# Quality-focused configuration
provider: "openai"
model: "gpt-4o" # Highest quality
conversation_type: "cot_hybrid" # Most comprehensive
num_steps: 5 # Smaller but higher quality
temperature: 0.2 # Very consistent
max_retries: 5 # Ensure success
Environment Variables¶
Set required API keys:
# OpenAI
export OPENAI_API_KEY="sk-..."
# Anthropic
export ANTHROPIC_API_KEY="sk-ant-..."
# Google Gemini
export GOOGLE_API_KEY="..."
# HuggingFace (for uploads)
export HUGGINGFACE_API_TOKEN="hf_..."
Validation and Testing¶
Configuration Validation¶
Test Runs¶
Monitoring Generation¶
Common Configuration Patterns¶
Academic Research Dataset¶
# For research papers and academic use
dataset_system_prompt: "You are a research assistant who provides rigorous scientific reasoning."
topic_tree:
topic_prompt: "Advanced research problems in [domain] requiring systematic investigation"
model: "gpt-4o"
temperature: 0.5
data_engine:
conversation_type: "cot_hybrid"
reasoning_style: "logical"
model: "gpt-4o"
temperature: 0.2
dataset:
creation:
num_steps: 25
Educational Content¶
# For tutoring and educational applications
dataset_system_prompt: "You are a patient teacher who explains concepts clearly."
topic_tree:
topic_prompt: "Educational problems suitable for [grade level] students"
temperature: 0.7
data_engine:
conversation_type: "cot_structured"
reasoning_style: "general"
temperature: 0.4
dataset:
creation:
sys_msg: true # Include teaching context
Industry Training Data¶
# For commercial applications
dataset_system_prompt: "You are a professional expert who provides practical solutions."
topic_tree:
topic_prompt: "Real-world problems in [industry domain] requiring professional expertise"
data_engine:
conversation_type: "cot_freetext" # Efficient for large datasets
reasoning_style: "general"
dataset:
creation:
num_steps: 100 # Large-scale generation
Troubleshooting Configuration Issues¶
Common Errors¶
"Invalid conversation_type"¶
# Incorrect
conversation_type: "cot-freetext" # Underscore, not hyphen
# Correct
conversation_type: "cot_freetext"
"Provider API key not found"¶
"Schema validation failed"¶
# Ensure all required fields are present
dataset_system_prompt: "Required field"
topic_tree: # Either topic_tree OR topic_graph required
topic_prompt: "Required field"
data_engine:
conversation_type: "Required for CoT"
Performance Issues¶
Slow generation¶
- Reduce
num_steps
for testing - Use smaller models (
gpt-4o-mini
) - Increase
batch_size
(with caution for CoT)
Poor quality output¶
- Increase model capability (
gpt-4o
) - Lower
temperature
(0.2-0.3) - Adjust prompts for clarity
- Use appropriate
reasoning_style
High costs¶
- Use
gpt-4o-mini
instead ofgpt-4o
- Choose
cot_freetext
overcot_hybrid
- Reduce
num_steps
- Optimize
temperature
to reduce retries
Next Steps¶
- Python API Configuration: → Python API Guide
- Math Reasoning Tutorial: → Math Reasoning Tutorial
- Advanced Reasoning Styles: → Reasoning Styles Guide
- Schema Reference: → Schema Reference