Advanced Workflows¶
Advanced DeepFabric workflows demonstrate patterns for complex dataset generation scenarios, including multi-stage processing, quality control pipelines, and large-scale production deployments. These examples showcase techniques that go beyond basic configuration to leverage the full capabilities of the system.
Multi-Provider Pipeline¶
This workflow uses different model providers optimized for different stages of the generation process:
# multi-provider-pipeline.yaml
dataset_system_prompt: "You are creating comprehensive educational content for software engineering professionals."
# Fast, economical topic generation
topic_tree:
topic_prompt: "Advanced software engineering practices"
topic_system_prompt: "You are creating comprehensive educational content for software engineering professionals."
degree: 5
depth: 3
temperature: 0.7
provider: "openai"
model: "gpt-4-turbo"
save_as: "engineering_topics.jsonl"
# High-quality content generation
data_engine:
instructions: "Create detailed, practical explanations with real-world examples and code samples suitable for senior developers."
generation_system_prompt: "You are creating comprehensive educational content for software engineering professionals."
provider: "anthropic"
model: "claude-sonnet-4-5"
temperature: 0.8
max_retries: 5
# Balanced final generation
dataset:
creation:
num_steps: 500
batch_size: 8
provider: "openai"
model: "gpt-5"
sys_msg: true
save_as: "engineering_dataset.jsonl"
This approach optimizes cost and quality by using GPT-3.5-turbo for broad topic exploration, claude-sonnet-4-5 for detailed content generation, and GPT-5 for final dataset creation.
Topic Graph with Visualization¶
Advanced topic graph generation with comprehensive analysis and visualization:
# research-graph-analysis.yaml
dataset_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."
topic_graph:
topic_prompt: "Machine learning research and applications in industry"
topic_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."
degree: 4
depth: 4
temperature: 0.8
provider: "anthropic"
model: "claude-sonnet-4-5"
save_as: "ml_research_graph.json"
data_engine:
instructions: "Create comprehensive research summaries with current trends, practical applications, and technical depth appropriate for graduate-level study."
generation_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."
provider: "openai"
model: "gpt-4"
temperature: 0.7
max_retries: 3
dataset:
creation:
num_steps: 200
batch_size: 6
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "ml_research_dataset.jsonl"
huggingface:
repository: "research-org/ml-research-synthesis"
tags:
- "machine-learning"
- "research"
- "graduate-level"
- "industry-applications"
Generate and analyze the complete workflow:
# Generate with graph visualization
deepfabric generate research-graph-analysis.yaml
# Create visualization for analysis
deepfabric visualize ml_research_graph.json --output research_structure
# Validate before publishing
deepfabric validate research-graph-analysis.yaml
# Upload to Hugging Face with metadata
deepfabric upload ml_research_dataset.jsonl --repo research-org/ml-research-synthesis
Quality Control Pipeline¶
Sophisticated quality control through validation, filtering, and iterative refinement:
# quality-controlled-generation.yaml
dataset_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."
topic_tree:
topic_prompt: "Modern web development frameworks and best practices"
topic_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."
degree: 4
depth: 3
temperature: 0.6 # Lower temperature for consistency
provider: "openai"
model: "gpt-4"
save_as: "webdev_topics.jsonl"
data_engine:
instructions: "Create technically accurate documentation with working code examples, best practices, and common pitfalls. Include version-specific information and real-world usage patterns."
generation_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."
provider: "anthropic"
model: "claude-sonnet-4-5"
temperature: 0.7
max_retries: 5
request_timeout: 60 # Extended timeout for quality
dataset:
creation:
num_steps: 300
batch_size: 4 # Smaller batches for quality control
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "webdev_documentation.jsonl"
Implement additional quality control through scripted validation:
#!/bin/bash
# quality-control-workflow.sh
# Step 1: Validate configuration
echo "Validating configuration..."
deepfabric validate quality-controlled-generation.yaml
if [ $? -ne 0 ]; then
echo "Configuration validation failed"
exit 1
fi
# Step 2: Generate with monitoring
echo "Starting generation with quality monitoring..."
deepfabric generate quality-controlled-generation.yaml
# Step 3: Post-generation analysis
echo "Analyzing generated dataset..."
python analyze_dataset.py webdev_documentation.jsonl
# Step 4: Quality metrics evaluation
echo "Evaluating quality metrics..."
python quality_metrics.py webdev_documentation.jsonl
# Step 5: Conditional upload based on quality scores
if [ $? -eq 0 ]; then
echo "Quality thresholds met, uploading to Hugging Face..."
deepfabric upload webdev_documentation.jsonl --repo tech-docs/webdev-guide
else
echo "Quality thresholds not met, review and regenerate"
exit 1
fi
Large-Scale Production Dataset¶
Configuration for generating large datasets with resource management and checkpointing:
# production-scale-dataset.yaml
dataset_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."
topic_tree:
topic_prompt: "Customer service scenarios across different industries and interaction types"
topic_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."
degree: 6 # Broad coverage
depth: 4 # Deep exploration
temperature: 0.8
provider: "openai"
model: "gpt-4"
save_as: "customer_service_topics.jsonl"
data_engine:
instructions: "Create realistic customer service conversations showing empathetic, helpful responses to various customer needs, complaints, and inquiries. Include diverse customer personalities and complex problem-solving scenarios."
generation_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."
provider: "openai"
model: "gpt-4"
temperature: 0.8
max_retries: 5
request_timeout: 45
dataset:
creation:
num_steps: 5000 # Large-scale generation
batch_size: 10 # Optimized for throughput
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "customer_service_dataset.jsonl"
huggingface:
repository: "enterprise-ai/customer-service-training"
tags:
- "customer-service"
- "conversation"
- "enterprise"
- "training-data"
Dataset Transformation Pipeline¶
Download existing datasets from Hugging Face Hub, transform them with multiple formatters, validate, and republish. This workflow is ideal for dataset curation and format standardization:
#!/bin/bash
# dataset-transformation-pipeline.sh
set -e # Exit on error
SOURCE_REPO="community/agent-reasoning-dataset"
TARGET_REPO="your-org/curated-reasoning-dataset"
TEMP_DIR="./pipeline_temp"
echo "=== Dataset Transformation Pipeline ==="
echo "Source: $SOURCE_REPO"
echo "Target: $TARGET_REPO"
# Create temporary working directory
mkdir -p $TEMP_DIR
cd $TEMP_DIR
# Stage 1: Download and format from Hub
echo ""
echo "Stage 1: Downloading and formatting from Hub..."
deepfabric format --repo $SOURCE_REPO --formatter trl -o stage1_trl.jsonl
# Stage 2: Apply secondary formatting for different training frameworks
echo ""
echo "Stage 2: Creating multiple format variants..."
deepfabric format stage1_trl.jsonl -f harmony -o stage2_harmony.jsonl
deepfabric format stage1_trl.jsonl -f conversations -o stage2_conversations.jsonl
deepfabric format stage1_trl.jsonl -f chatml -o stage2_chatml.jsonl
# Stage 3: Validate all outputs
echo ""
echo "Stage 3: Validating transformed datasets..."
python ../validate_formats.py stage1_trl.jsonl stage2_harmony.jsonl stage2_conversations.jsonl stage2_chatml.jsonl
# Stage 4: Quality assessment
echo ""
echo "Stage 4: Running quality assessment..."
python ../assess_quality.py stage2_*.jsonl
# Stage 5: Upload curated versions
echo ""
echo "Stage 5: Uploading curated datasets..."
deepfabric upload stage1_trl.jsonl \
--repo ${TARGET_REPO}-trl \
--tags curated trl agent-tools training
deepfabric upload stage2_harmony.jsonl \
--repo ${TARGET_REPO}-harmony \
--tags curated harmony gpt-oss training
deepfabric upload stage2_conversations.jsonl \
--repo ${TARGET_REPO}-conversations \
--tags curated conversations training
deepfabric upload stage2_chatml.jsonl \
--repo ${TARGET_REPO}-chatml \
--tags curated chatml training
echo ""
echo "=== Pipeline Complete ==="
echo "Curated datasets available at:"
echo " - https://huggingface.co/datasets/${TARGET_REPO}-trl"
echo " - https://huggingface.co/datasets/${TARGET_REPO}-harmony"
echo " - https://huggingface.co/datasets/${TARGET_REPO}-conversations"
echo " - https://huggingface.co/datasets/${TARGET_REPO}-chatml"
# Cleanup
cd ..
rm -rf $TEMP_DIR