Hugging Face Integration¶
DeepFabric's Hugging Face Hub integration streamlines dataset publishing with automatic metadata generation, dataset cards, and community sharing features. This integration transforms synthetic datasets into discoverable, well-documented resources for the machine learning community.
Basic Hub Integration¶
Simple dataset upload with automatic documentation:
# basic-hf-upload.yaml
dataset_system_prompt: "You are creating educational programming content for computer science students."
topic_tree:
topic_prompt: "Python programming fundamentals for beginners"
topic_system_prompt: "You are creating educational programming content for computer science students."
degree: 4
depth: 2
temperature: 0.7
provider: "openai"
model: "gpt-3.5-turbo"
save_as: "python_basics_topics.jsonl"
data_engine:
instructions: "Create clear, beginner-friendly programming examples with step-by-step explanations and practical exercises."
generation_system_prompt: "You are creating educational programming content for computer science students."
provider: "openai"
model: "gpt-4"
temperature: 0.8
max_retries: 3
dataset:
creation:
num_steps: 100
batch_size: 5
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "python_beginners_dataset.jsonl"
# Hugging Face Hub configuration
huggingface:
repository: "education/python-programming-basics"
tags:
- "programming"
- "python"
- "education"
- "beginner-friendly"
- "code-examples"
Generate and upload with single command:
# Set authentication
export HF_TOKEN="your-huggingface-token"
# Generate and auto-upload
deepfabric generate basic-hf-upload.yaml
Multi-Dataset Repository¶
Organize related datasets in a single repository with different components:
# comprehensive-ml-course.yaml
dataset_system_prompt: "You are creating a comprehensive machine learning curriculum with theoretical foundations and practical applications."
topic_tree:
topic_prompt: "Machine learning concepts from basics to advanced applications"
topic_system_prompt: "You are creating a comprehensive machine learning curriculum with theoretical foundations and practical applications."
degree: 5
depth: 3
temperature: 0.7
provider: "anthropic"
model: "claude-3-sonnet"
save_as: "ml_course_topics.jsonl"
data_engine:
instructions: "Create detailed explanations with mathematical foundations, practical examples, and real-world applications suitable for undergraduate and graduate students."
generation_system_prompt: "You are creating a comprehensive machine learning curriculum with theoretical foundations and practical applications."
provider: "anthropic"
model: "claude-3-opus"
temperature: 0.8
max_retries: 3
dataset:
creation:
num_steps: 300
batch_size: 6
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "ml_course_dataset.jsonl"
huggingface:
repository: "university/comprehensive-ml-curriculum"
tags:
- "machine-learning"
- "education"
- "curriculum"
- "undergraduate"
- "graduate"
- "mathematics"
- "practical-applications"
Upload multiple related datasets:
# Generate different course components
deepfabric generate comprehensive-ml-course.yaml
# Generate specialized components with parameter overrides
deepfabric generate comprehensive-ml-course.yaml \
--dataset-save-as "ml_fundamentals.jsonl" \
--num-steps 150 \
--depth 2
deepfabric generate comprehensive-ml-course.yaml \
--dataset-save-as "ml_advanced_topics.jsonl" \
--num-steps 200 \
--temperature 0.9
# Upload each component with specific tags
deepfabric upload ml_course_dataset.jsonl \
--repo university/comprehensive-ml-curriculum \
--tags fundamentals theory
deepfabric upload ml_fundamentals.jsonl \
--repo university/comprehensive-ml-curriculum \
--tags basics introduction
deepfabric upload ml_advanced_topics.jsonl \
--repo university/comprehensive-ml-curriculum \
--tags advanced research-topics
Enterprise Dataset Publishing¶
Professional dataset publishing with comprehensive documentation:
# enterprise-customer-support.yaml
dataset_system_prompt: "You are creating professional customer support training data that demonstrates excellence in customer service across various industries and scenarios."
topic_tree:
topic_prompt: "Customer support excellence across industries: retail, technology, healthcare, finance, and services"
topic_system_prompt: "You are creating professional customer support training data that demonstrates excellence in customer service across various industries and scenarios."
degree: 5
depth: 4
temperature: 0.8
provider: "openai"
model: "gpt-4"
save_as: "customer_support_topics.jsonl"
data_engine:
instructions: "Create realistic, professional customer service interactions demonstrating empathy, problem-solving skills, and industry-specific knowledge. Include complex scenarios, difficult customers, and exemplary resolution techniques."
generation_system_prompt: "You are creating professional customer support training data that demonstrates excellence in customer service across various industries and scenarios."
provider: "anthropic"
model: "claude-3-opus"
temperature: 0.8
max_retries: 5
request_timeout: 60
dataset:
creation:
num_steps: 1000
batch_size: 8
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "enterprise_customer_support.jsonl"
huggingface:
repository: "enterprise-ai/customer-support-excellence"
tags:
- "customer-service"
- "professional-training"
- "multi-industry"
- "conversation"
- "enterprise"
- "support-excellence"
- "training-data"
Professional deployment with quality assurance:
# enterprise_deployment.py
import os
import json
import logging
from pathlib import Path
from typing import Dict, List
from deepfabric import DeepFabricConfig
def validate_enterprise_dataset(dataset_path: str) -> Dict[str, any]:
"""Validate enterprise dataset for quality and compliance."""
validation_metrics = {
"total_conversations": 0,
"average_length": 0,
"professional_language_score": 0,
"industry_coverage": set(),
"quality_indicators": {
"empathy_markers": 0,
"solution_oriented": 0,
"professional_tone": 0
}
}
professional_markers = ["understand", "apologize", "help", "resolve", "appreciate"]
solution_markers = ["solution", "fix", "resolve", "address", "handle"]
with open(dataset_path, 'r') as f:
conversations = []
for line in f:
conversation = json.loads(line)
conversations.append(conversation)
validation_metrics["total_conversations"] += 1
# Analyze conversation content
content = conversation["messages"][-1]["content"].lower()
# Check for professional markers
empathy_count = sum(1 for marker in professional_markers if marker in content)
solution_count = sum(1 for marker in solution_markers if marker in content)
validation_metrics["quality_indicators"]["empathy_markers"] += empathy_count
validation_metrics["quality_indicators"]["solution_oriented"] += solution_count
# Estimate professional tone (simplified)
if empathy_count > 0 and solution_count > 0:
validation_metrics["quality_indicators"]["professional_tone"] += 1
# Calculate averages
if validation_metrics["total_conversations"] > 0:
total = validation_metrics["total_conversations"]
validation_metrics["professional_language_score"] = (
validation_metrics["quality_indicators"]["professional_tone"] / total
)
return validation_metrics
def deploy_enterprise_dataset(config_path: str):
"""Deploy enterprise dataset with full validation pipeline."""
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Load and validate configuration
logger.info("Loading configuration...")
config = DeepFabricConfig.from_yaml(config_path)
# Validate configuration
logger.info("Validating configuration...")
validation = config.validate()
if not validation.is_valid:
logger.error("Configuration validation failed")
for error in validation.errors:
logger.error(f" - {error}")
return False
# Generate dataset (this would typically use the CLI)
logger.info("Dataset generation would occur here...")
# Post-generation validation
dataset_path = config.get_dataset_config()["save_as"]
logger.info(f"Validating generated dataset: {dataset_path}")
metrics = validate_enterprise_dataset(dataset_path)
# Quality gates
min_professional_score = 0.8
min_conversations = 500
if metrics["professional_language_score"] < min_professional_score:
logger.error(f"Professional language score {metrics['professional_language_score']:.2f} below threshold {min_professional_score}")
return False
if metrics["total_conversations"] < min_conversations:
logger.error(f"Total conversations {metrics['total_conversations']} below minimum {min_conversations}")
return False
logger.info("All quality gates passed")
logger.info(f"Professional Language Score: {metrics['professional_language_score']:.2%}")
logger.info(f"Total Conversations: {metrics['total_conversations']}")
# Upload to Hugging Face
hf_config = config.get_huggingface_config()
repo = hf_config.get("repository")
if repo:
logger.info(f"Uploading to Hugging Face Hub: {repo}")
# Upload command would go here
# subprocess.run(["deepfabric", "upload", dataset_path, "--repo", repo])
return True
if __name__ == "__main__":
deploy_enterprise_dataset("enterprise-customer-support.yaml")
Research Dataset with Comprehensive Metadata¶
Academic dataset publication with detailed provenance and methodology documentation:
# research-nlp-dataset.yaml
dataset_system_prompt: "You are creating research-quality natural language processing datasets with focus on linguistic diversity, theoretical soundness, and reproducibility."
# Auto-detects graph mode since topic_graph section is present
topic_graph:
topic_prompt: "Natural language processing research areas: syntax, semantics, pragmatics, computational linguistics, and applications"
topic_system_prompt: "You are creating research-quality natural language processing datasets with focus on linguistic diversity, theoretical soundness, and reproducibility."
degree: 4
depth: 3
temperature: 0.8
provider: "anthropic"
model: "claude-3-opus"
save_as: "nlp_research_graph.json"
data_engine:
instructions: "Create academically rigorous natural language processing examples with theoretical grounding, citing relevant literature where appropriate, and demonstrating complex linguistic phenomena suitable for graduate-level research."
generation_system_prompt: "You are creating research-quality natural language processing datasets with focus on linguistic diversity, theoretical soundness, and reproducibility."
provider: "openai"
model: "gpt-4"
temperature: 0.7
max_retries: 5
dataset:
creation:
num_steps: 400
batch_size: 4
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "nlp_research_dataset.jsonl"
huggingface:
repository: "research-lab/nlp-theoretical-foundations"
tags:
- "natural-language-processing"
- "computational-linguistics"
- "research"
- "theoretical"
- "graduate-level"
- "linguistics"
- "syntax"
- "semantics"
- "pragmatics"
Complete research workflow with visualization and documentation:
#!/bin/bash
# research-publication-workflow.sh
echo "=== NLP Research Dataset Publication Workflow ==="
# Step 1: Configuration validation
echo "Step 1: Validating research configuration..."
deepfabric validate research-nlp-dataset.yaml
if [ $? -ne 0 ]; then
echo "Configuration validation failed - aborting"
exit 1
fi
# Step 2: Generate dataset with graph structure
echo "Step 2: Generating research dataset..."
deepfabric generate research-nlp-dataset.yaml
# Step 3: Create research visualizations
echo "Step 3: Creating topic graph visualization..."
deepfabric visualize nlp_research_graph.json --output research_topology
# Step 4: Generate research documentation
echo "Step 4: Generating research documentation..."
python generate_research_metadata.py nlp_research_dataset.jsonl nlp_research_graph.json
# Step 5: Quality assessment
echo "Step 5: Conducting quality assessment..."
python research_quality_assessment.py nlp_research_dataset.jsonl
# Step 6: Upload with comprehensive metadata
echo "Step 6: Publishing to Hugging Face Hub..."
deepfabric upload nlp_research_dataset.jsonl \
--repo research-lab/nlp-theoretical-foundations \
--tags nlp research theoretical graduate-level
echo "=== Publication workflow complete ==="
echo "Dataset available at: https://huggingface.co/datasets/research-lab/nlp-theoretical-foundations"
echo "Visualization available at: research_topology.svg"
Community Dataset with Collaborative Features¶
Open-source community dataset with broad accessibility:
# community-programming-help.yaml
dataset_system_prompt: "You are creating community-driven programming help content that demonstrates collaborative problem-solving, mentoring approaches, and inclusive technical communication."
topic_tree:
topic_prompt: "Programming help and mentorship across languages, frameworks, and skill levels"
topic_system_prompt: "You are creating community-driven programming help content that demonstrates collaborative problem-solving, mentoring approaches, and inclusive technical communication."
degree: 6
depth: 3
temperature: 0.8
provider: "openai"
model: "gpt-4"
save_as: "programming_help_topics.jsonl"
data_engine:
instructions: "Create supportive, educational programming discussions that demonstrate effective mentoring, inclusive language, and collaborative problem-solving approaches suitable for diverse technical communities."
generation_system_prompt: "You are creating community-driven programming help content that demonstrates collaborative problem-solving, mentoring approaches, and inclusive technical communication."
provider: "openai"
model: "gpt-4"
temperature: 0.8
max_retries: 3
dataset:
creation:
num_steps: 750
batch_size: 10
provider: "openai"
model: "gpt-4"
sys_msg: true
save_as: "community_programming_help.jsonl"
huggingface:
repository: "community/programming-mentorship"
tags:
- "programming"
- "mentorship"
- "community"
- "collaborative"
- "inclusive"
- "help"
- "education"
- "open-source"
The Hugging Face integration provides a complete pathway from synthetic data generation to community sharing, enabling researchers and practitioners to contribute high-quality synthetic datasets to the broader machine learning ecosystem.