Skip to content

DeepFabric

DeepFabric transforms the process of creating synthetic datasets for language model training, evaluation, and research. Built around the concept of topic-driven data generation, it provides both hierarchical topic trees and experimental graph-based topic modeling to create diverse, contextually rich training examples.

The library serves researchers, engineers, and practitioners who need high-quality synthetic data for model distillation, agent evaluation, or statistical research. Whether you're generating conversational datasets, creating domain-specific training examples, or building evaluation benchmarks, DeepFabric provides the tools to scale your data generation process while maintaining quality and diversity.

Core Capabilities

DeepFabric operates through a three-stage pipeline that transforms a simple prompt into a comprehensive dataset. The process begins with topic generation, where the system creates either a hierarchical tree structure or a more complex graph representation of your domain. These topics then feed into the dataset generation engine, which produces contextually appropriate training examples. Finally, the system packages everything into standard formats ready for immediate use.

The topic modeling approach sets DeepFabric apart from simple prompt-based generation. Rather than creating isolated examples, the system builds a conceptual map of your domain and generates examples that explore different aspects systematically. This ensures broader coverage and more consistent quality across your dataset.

Topic Trees and Graphs

Traditional topic trees provide a hierarchical breakdown of subjects, ideal for domains with clear categorical structures. The experimental topic graph feature extends this concept by allowing cross-connections between topics, creating more realistic representations of complex domains where concepts naturally interconnect.

Both approaches leverage large language models to intelligently expand topics and generate relevant content, but they serve different use cases depending on your domain's structure and complexity requirements.

Choosing Between Trees and Graphs

Topic trees work well for domains with clear hierarchical relationships, such as academic subjects, product categories, or organizational structures. Topic graphs excel in interconnected domains like research areas, technical concepts, or social phenomena where relationships span multiple categories.

Getting Started

The fastest path to your first dataset involves three simple steps: installation, configuration, and generation. The Getting Started section walks through this process with practical examples that you can run immediately.

For those preferring configuration-driven workflows, DeepFabric's YAML format provides comprehensive control over every aspect of generation. Developers seeking programmatic integration can access the full API through Python classes that mirror the CLI functionality.

Integration Ecosystem

DeepFabric integrates seamlessly with the modern machine learning ecosystem. Built on LiteLLM, it supports virtually any language model provider including OpenAI, Anthropic, local Ollama instances, and cloud-based solutions. Generated datasets export directly to Hugging Face Hub with automatic dataset cards and metadata.

The modular CLI design supports complex workflows through commands like deepfabric validate for configuration checking, deepfabric visualize for topic graph exploration, and deepfabric upload for streamlined dataset publishing.

Next Steps

Begin with the Installation Guide to set up your environment, then follow the First Dataset tutorial to generate your initial synthetic dataset. The Configuration Guide provides comprehensive coverage of YAML options, while the API Reference documents programmatic usage patterns.