Dataset Generation¶
Dataset generation transforms topic structures into practical training examples through configurable templates, quality control mechanisms, and batch processing systems. This stage converts abstract topic concepts into concrete conversational data ready for language model training or evaluation.
The generation process operates independently from topic creation, allowing experimentation with different generation parameters while maintaining consistent topic coverage. This separation enables iterative refinement of output quality without regenerating the underlying topic structure.
Generation Pipeline¶
Dataset creation follows a systematic pipeline that ensures consistency and quality across all generated examples:
Topic Selection chooses specific topics from the tree or graph structure, either sequentially or through sampling strategies that ensure balanced coverage.
Template Application applies configurable instruction templates to each selected topic, creating specific generation prompts that guide the model toward desired output formats.
Content Generation invokes the language model with constructed prompts, applying quality control measures and retry logic for robust operation.
Validation and Filtering processes generated content to ensure format compliance, quality standards, and appropriateness for the intended use case.
Configuration Parameters¶
Dataset generation is controlled through comprehensive configuration options:
data_engine:
instructions: "Create a detailed explanation with practical examples suitable for intermediate learners."
generation_system_prompt: "You are an expert instructor providing clear, comprehensive explanations with practical examples and detailed guidance."
provider: "openai"
model: "gpt-4"
temperature: 0.8
max_retries: 3
request_timeout: 30
default_batch_size: 5
default_num_examples: 3
dataset:
creation:
num_steps: 100 # Total examples to generate
batch_size: 5 # Examples per API call
provider: "anthropic" # Can differ from data_engine
model: "claude-3-opus"
sys_msg: true # Include system messages
save_as: "training_dataset.jsonl"
The instructions
field provides the core guidance for content generation, specifying format requirements, complexity level, target audience, and quality expectations.
Template System¶
The instruction template system enables sophisticated content generation through structured prompts:
Basic Templates provide simple instruction formats suitable for straightforward question-answer generation or explanation tasks.
Advanced Templates support complex formatting requirements including code examples, structured outputs, and multi-part responses.
Domain-Specific Templates incorporate specialized knowledge about particular fields, ensuring generated content meets domain-specific requirements and conventions.
Quality Control Mechanisms¶
Multiple layers of quality control ensure consistent output:
Retry Logic automatically retries failed generation attempts with exponential backoff, handling temporary API issues and model inconsistencies.
Format Validation checks generated content against expected structures, filtering out malformed responses that don't meet format requirements.
Content Filtering applies domain-specific quality criteria to ensure generated examples meet appropriateness and accuracy standards.
Statistical Monitoring tracks generation success rates, error categories, and quality metrics to identify systematic issues.
Batch Processing¶
Efficient batch processing balances throughput with resource consumption:
Batch Size Optimization groups multiple generation requests to minimize API overhead while respecting rate limiting and memory constraints.
Parallel Processing where supported by the model provider, enables concurrent generation requests to maximize throughput.
Progress Monitoring provides real-time feedback on generation progress, success rates, and estimated completion times.
Error Recovery maintains generation state to enable resumption after interruptions or failures.
Output Formats¶
Generated datasets support multiple output formats depending on intended use:
Conversational Format creates structured dialogues with explicit role assignments suitable for chat model training:
{
"messages": [
{
"role": "system",
"content": "You are an expert instructor providing clear explanations."
},
{
"role": "user",
"content": "Can you explain recursion in programming?"
},
{
"role": "assistant",
"content": "Recursion is a programming technique where a function calls itself..."
}
]
}
Instruction Format creates prompt-response pairs suitable for instruction-following model training:
{
"instruction": "Explain the concept of recursion with a simple example",
"response": "Recursion is a programming technique where a function calls itself..."
}
Custom Formats support application-specific requirements through configurable output templates.
System Message Control¶
The sys_msg
parameter provides fine-grained control over system message inclusion:
With System Messages (sys_msg: true
) creates training examples that include explicit role definitions, helping models understand their intended behavior and context.
Without System Messages (sys_msg: false
) creates simpler user-assistant pairs that focus purely on input-output relationships without explicit role context.
This flexibility enables optimization for different training scenarios and model architectures.
Performance Optimization¶
Several strategies optimize generation performance:
Provider Selection enables using different models for different stages, balancing cost, speed, and quality based on specific requirements.
Parameter Tuning adjusts temperature, batch size, and retry parameters to optimize for your specific use case and infrastructure constraints.
Resource Management monitors API usage, costs, and rate limiting to ensure efficient resource utilization.
Generation Strategy Recommendations
Start with smaller batch sizes and moderate temperatures for initial testing. Increase batch sizes based on your API rate limits and system resources. Use higher-quality models for final dataset generation after validating your configuration with faster, less expensive models.