Skip to content

Advanced Workflows

Advanced DeepFabric workflows demonstrate sophisticated patterns for complex dataset generation scenarios, including multi-stage processing, quality control pipelines, and large-scale production deployments. These examples showcase techniques that go beyond basic configuration to leverage the full capabilities of the system.

Multi-Provider Pipeline

This workflow uses different model providers optimized for different stages of the generation process:

# multi-provider-pipeline.yaml
dataset_system_prompt: "You are creating comprehensive educational content for software engineering professionals."

# Fast, economical topic generation
topic_tree:
  topic_prompt: "Advanced software engineering practices"
  topic_system_prompt: "You are creating comprehensive educational content for software engineering professionals."
  degree: 5
  depth: 3
  temperature: 0.7
  provider: "openai"
  model: "gpt-3.5-turbo"
  save_as: "engineering_topics.jsonl"

# High-quality content generation
data_engine:
  instructions: "Create detailed, practical explanations with real-world examples and code samples suitable for senior developers."
  generation_system_prompt: "You are creating comprehensive educational content for software engineering professionals."
  provider: "anthropic"
  model: "claude-3-opus"
  temperature: 0.8
  max_retries: 5

# Balanced final generation
dataset:
  creation:
    num_steps: 500
    batch_size: 8
    provider: "openai"
    model: "gpt-4"
    sys_msg: true
  save_as: "engineering_dataset.jsonl"

This approach optimizes cost and quality by using GPT-3.5-turbo for broad topic exploration, Claude-3-Opus for detailed content generation, and GPT-4 for final dataset creation.

Topic Graph with Visualization

Advanced topic graph generation with comprehensive analysis and visualization:

# research-graph-analysis.yaml
dataset_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."

topic_graph:
  topic_prompt: "Machine learning research and applications in industry"
  topic_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."
  degree: 4
  depth: 4
  temperature: 0.8
  provider: "anthropic"
  model: "claude-3-opus"
  save_as: "ml_research_graph.json"

data_engine:
  instructions: "Create comprehensive research summaries with current trends, practical applications, and technical depth appropriate for graduate-level study."
  generation_system_prompt: "You are mapping the interconnected landscape of machine learning research areas with focus on practical applications and theoretical foundations."
  provider: "openai"
  model: "gpt-4"
  temperature: 0.7
  max_retries: 3

dataset:
  creation:
    num_steps: 200
    batch_size: 6
    provider: "openai"
    model: "gpt-4"
    sys_msg: true
  save_as: "ml_research_dataset.jsonl"

huggingface:
  repository: "research-org/ml-research-synthesis"
  tags:
    - "machine-learning"
    - "research"
    - "graduate-level"
    - "industry-applications"

Generate and analyze the complete workflow:

# Generate with graph visualization
deepfabric generate research-graph-analysis.yaml

# Create visualization for analysis
deepfabric visualize ml_research_graph.json --output research_structure

# Validate before publishing
deepfabric validate research-graph-analysis.yaml

# Upload to Hugging Face with metadata
deepfabric upload ml_research_dataset.jsonl --repo research-org/ml-research-synthesis

Quality Control Pipeline

Sophisticated quality control through validation, filtering, and iterative refinement:

# quality-controlled-generation.yaml
dataset_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."

topic_tree:
  topic_prompt: "Modern web development frameworks and best practices"
  topic_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."
  degree: 4
  depth: 3
  temperature: 0.6  # Lower temperature for consistency
  provider: "openai"
  model: "gpt-4"
  save_as: "webdev_topics.jsonl"

data_engine:
  instructions: "Create technically accurate documentation with working code examples, best practices, and common pitfalls. Include version-specific information and real-world usage patterns."
  generation_system_prompt: "You are creating high-quality technical documentation with emphasis on accuracy, clarity, and practical utility."
  provider: "anthropic"
  model: "claude-3-opus"
  temperature: 0.7
  max_retries: 5
  request_timeout: 60  # Extended timeout for quality

dataset:
  creation:
    num_steps: 300
    batch_size: 4  # Smaller batches for quality control
    provider: "openai"
    model: "gpt-4"
    sys_msg: true
  save_as: "webdev_documentation.jsonl"

Implement additional quality control through scripted validation:

#!/bin/bash
# quality-control-workflow.sh

# Step 1: Validate configuration
echo "Validating configuration..."
deepfabric validate quality-controlled-generation.yaml
if [ $? -ne 0 ]; then
    echo "Configuration validation failed"
    exit 1
fi

# Step 2: Generate with monitoring
echo "Starting generation with quality monitoring..."
deepfabric generate quality-controlled-generation.yaml

# Step 3: Post-generation analysis
echo "Analyzing generated dataset..."
python analyze_dataset.py webdev_documentation.jsonl

# Step 4: Quality metrics evaluation
echo "Evaluating quality metrics..."
python quality_metrics.py webdev_documentation.jsonl

# Step 5: Conditional upload based on quality scores
if [ $? -eq 0 ]; then
    echo "Quality thresholds met, uploading to Hugging Face..."
    deepfabric upload webdev_documentation.jsonl --repo tech-docs/webdev-guide
else
    echo "Quality thresholds not met, review and regenerate"
    exit 1
fi

Large-Scale Production Dataset

Configuration for generating large datasets with resource management and checkpointing:

# production-scale-dataset.yaml
dataset_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."

topic_tree:
  topic_prompt: "Customer service scenarios across different industries and interaction types"
  topic_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."
  degree: 6  # Broad coverage
  depth: 4   # Deep exploration
  temperature: 0.8
  provider: "openai"
  model: "gpt-4"
  save_as: "customer_service_topics.jsonl"

data_engine:
  instructions: "Create realistic customer service conversations showing empathetic, helpful responses to various customer needs, complaints, and inquiries. Include diverse customer personalities and complex problem-solving scenarios."
  generation_system_prompt: "You are creating comprehensive training data for customer service AI systems, focusing on natural conversation patterns and helpful problem-solving approaches."
  provider: "openai"
  model: "gpt-4"
  temperature: 0.8
  max_retries: 5
  request_timeout: 45

dataset:
  creation:
    num_steps: 5000  # Large-scale generation
    batch_size: 10   # Optimized for throughput
    provider: "openai"
    model: "gpt-4"
    sys_msg: true
  save_as: "customer_service_dataset.jsonl"

huggingface:
  repository: "enterprise-ai/customer-service-training"
  tags:
    - "customer-service"
    - "conversation"
    - "enterprise"
    - "training-data"

Production deployment script with monitoring and resource management:

# production_deployment.py
import time
import logging
from deepfabric import DeepFabricConfig, DataSetGenerator, Tree

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def deploy_large_scale_generation(config_path, checkpoint_interval=500):
    """Deploy large-scale generation with checkpointing and monitoring."""

    config = DeepFabricConfig.from_yaml(config_path)

    # Load or create topic tree
    tree = Tree(**config.get_tree_args())
    tree.build()
    tree.save("production_topics.jsonl")

    # Create generator with production settings
    generator = DataSetGenerator(**config.get_engine_args())

    # Large-scale generation with checkpointing
    dataset_config = config.get_dataset_config()
    total_steps = dataset_config["creation"]["num_steps"]
    batch_size = dataset_config["creation"]["batch_size"]

    completed = 0
    start_time = time.time()

    while completed < total_steps:
        remaining = min(checkpoint_interval, total_steps - completed)

        logger.info(f"Generating batch {completed}-{completed + remaining}")

        batch_dataset = generator.create_data(
            num_steps=remaining,
            batch_size=batch_size,
            topic_model=tree
        )

        # Save checkpoint
        checkpoint_file = f"checkpoint_{completed}_{completed + remaining}.jsonl"
        batch_dataset.save(checkpoint_file)

        completed += remaining
        elapsed = time.time() - start_time
        rate = completed / elapsed

        logger.info(f"Progress: {completed}/{total_steps} ({completed/total_steps:.1%})")
        logger.info(f"Rate: {rate:.1f} examples/second")
        logger.info(f"ETA: {(total_steps - completed) / rate / 60:.1f} minutes")

if __name__ == "__main__":
    deploy_large_scale_generation("production-scale-dataset.yaml")

Domain-Specific Validation

Custom validation pipeline for specialized domains:

# domain_validator.py
import json
import re
from typing import List, Dict, Tuple

def validate_code_examples(dataset_path: str) -> Dict[str, int]:
    """Validate code examples in generated dataset."""

    validation_results = {
        "total_examples": 0,
        "valid_code_blocks": 0,
        "syntax_errors": 0,
        "missing_explanations": 0,
        "quality_score": 0
    }

    with open(dataset_path, 'r') as f:
        for line in f:
            example = json.loads(line)
            validation_results["total_examples"] += 1

            # Extract code blocks
            code_blocks = re.findall(r'```[\w]*\n(.*?)\n```', 
                                   example["messages"][-1]["content"], 
                                   re.DOTALL)

            if code_blocks:
                validation_results["valid_code_blocks"] += 1

                # Basic syntax validation (simplified)
                for code in code_blocks:
                    try:
                        compile(code, '<string>', 'exec')
                    except SyntaxError:
                        validation_results["syntax_errors"] += 1

            # Check for explanations
            content = example["messages"][-1]["content"]
            if len(content) > 200 and any(word in content.lower() 
                                        for word in ["because", "this", "when", "why"]):
                validation_results["quality_score"] += 1

    # Calculate quality metrics
    if validation_results["total_examples"] > 0:
        quality_rate = validation_results["quality_score"] / validation_results["total_examples"]
        validation_results["overall_quality"] = quality_rate

    return validation_results

def main():
    results = validate_code_examples("webdev_documentation.jsonl")
    print(f"Dataset Quality Report:")
    print(f"Total Examples: {results['total_examples']}")
    print(f"Code Block Coverage: {results['valid_code_blocks']}/{results['total_examples']}")
    print(f"Syntax Error Rate: {results['syntax_errors']}/{results['valid_code_blocks']}")
    print(f"Overall Quality Score: {results['overall_quality']:.2%}")

if __name__ == "__main__":
    main()

These advanced workflows demonstrate production-ready patterns for sophisticated dataset generation scenarios, including resource optimization, quality control, and comprehensive validation pipelines.