Rate Limiting & Intelligent Retry¶

DeepFabric includes a rate limiting and retry system that handles API rate limits, quota exhaustion, and transient errors across different LLM providers.

Overview¶

The rate limiting system provides:

Provider-Aware: Automatically detects and handles rate limits for OpenAI, Anthropic Claude, Google Gemini, and Ollama
Backoff: Exponential backoff with jitter to prevent thundering herd problems
Retry-After Headers: Respects server-specified wait times when available
Fail-Fast Detection: Identifies non-retryable errors (e.g., daily quota exhaustion) to avoid wasting time
Configurable: Fine-grained control via YAML or Python API
Type-Safe: Full Pydantic validation and type hints

Quick Start¶

Using Provider Defaults¶

The simplest approach is to let DeepFabric use provider-specific defaults:

data_engine:
  provider: "gemini"
  model: "gemini-2.0-flash-exp"
  generation_system_prompt: "You are a helpful AI assistant."
  # Rate limiting uses defaults - no configuration needed!

Each provider has optimized defaults:

OpenAI: max_retries=5, base_delay=1.0s, max_delay=60s
Anthropic: max_retries=5, base_delay=1.0s, max_delay=60s
Gemini: max_retries=5, base_delay=2.0s, max_delay=120s (more conservative)
Ollama: max_retries=2, base_delay=0.5s, max_delay=5s (local, minimal retry)

Custom Rate Limiting¶

For fine-grained control, add a rate_limit section:

data_engine:
  provider: "gemini"
  model: "gemini-2.0-flash-exp"
  generation_system_prompt: "You are a helpful AI assistant."

  rate_limit:
    max_retries: 7
    base_delay: 3.0
    max_delay: 180.0
    backoff_strategy: "exponential_jitter"
    exponential_base: 2.0
    jitter: true
    respect_retry_after: true

Configuration Options¶

`max_retries`¶

Type: Integer (0-20)
Default: 5
Description: Maximum number of retry attempts before giving up

rate_limit:
  max_retries: 5  # Try up to 5 times total

`base_delay`¶

Type: Float (0.1-60.0 seconds)
Default: 1.0 (OpenAI/Anthropic), 2.0 (Gemini)
Description: Base delay in seconds before the first retry

rate_limit:
  base_delay: 2.0  # Wait 2 seconds before first retry

`max_delay`¶

Type: Float (1.0-300.0 seconds)
Default: 60.0 (OpenAI/Anthropic), 120.0 (Gemini)
Description: Maximum delay between retries (prevents excessive wait times)

rate_limit:
  max_delay: 120.0  # Never wait more than 2 minutes

`backoff_strategy`¶

Type: String (enum)
Default: "exponential_jitter"
Options:
"exponential": delay = base_delay * (exponential_base ^ attempt)
"exponential_jitter": Exponential with ±25% randomization (recommended)
"linear": delay = base_delay * attempt
"constant": Always use base_delay

rate_limit:
  backoff_strategy: "exponential_jitter"  # Recommended for production

Why Exponential with Jitter?

Jitter adds randomization (±25%) to prevent the "thundering herd" problem where multiple clients retry simultaneously, creating spikes that trigger more rate limits.

`exponential_base`¶

Type: Float (1.1-10.0)
Default: 2.0
Description: Base multiplier for exponential backoff

rate_limit:
  exponential_base: 2.0  # Delays: 2s, 4s, 8s, 16s, 32s, ...

Example delays with different bases: - base=1.5: 1.5s, 2.25s, 3.375s, 5.06s, 7.59s - base=2.0: 2s, 4s, 8s, 16s, 32s - base=3.0: 3s, 9s, 27s, 81s, ...

`jitter`¶

Type: Boolean
Default: true
Description: Add ±25% randomization to delays

rate_limit:
  jitter: true  # Prevents synchronized retries

`respect_retry_after`¶

Type: Boolean
Default: true
Description: Honor retry-after headers from provider responses

rate_limit:
  respect_retry_after: true  # Use server-specified wait times

When true, the system prioritizes server-specified wait times over calculated backoff.

Provider-Specific Behavior¶

OpenAI¶

Headers Monitored: x-ratelimit-remaining-requests, x-ratelimit-limit-requests, retry-after
Rate Limit Types: RPM (requests per minute), TPM (tokens per minute)
Quota Errors: Distinguishes between rate limits and quota exhaustion
Retry Strategy: Respects retry-after header

data_engine:
  provider: "openai"
  model: "gpt-4"
  rate_limit:
    max_retries: 5
    respect_retry_after: true  # Always honor OpenAI's retry-after

Anthropic Claude¶

Headers Monitored: anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-remaining, retry-after
Algorithm: Token bucket with continuous replenishment
Rate Limit Types: RPM, ITPM (input tokens/min), OTPM (output tokens/min)
Tiers: 4 automatic tiers based on credit purchases

data_engine:
  provider: "anthropic"
  model: "claude-3-5-sonnet-20241022"
  rate_limit:
    max_retries: 5
    base_delay: 1.0
    gradual_rampup: true  # Anthropic recommends gradual traffic increases

Google Gemini¶

Rate Limit Types: RPM, TPM, RPD (requests per day)
Daily Quota: Resets at midnight Pacific time
Error Format: 429 RESOURCE_EXHAUSTED with QuotaFailure details
No Retry-After Header: Uses conservative backoff strategy
Fail-Fast: Detects daily quota exhaustion and stops retrying

data_engine:
  provider: "gemini"
  model: "gemini-2.0-flash-exp"
  rate_limit:
    max_retries: 5
    base_delay: 2.0      # Higher default for Gemini
    max_delay: 120.0     # Longer max for daily quota
    daily_quota_aware: true  # Detect RPD exhaustion

Gemini Daily Quota Exhaustion: When Gemini's daily quota is exhausted (RPD limit), the system detects this and fails fast rather than retrying, since the quota won't reset until midnight Pacific time.

Ollama (Local)¶

Local Deployment: Minimal rate limiting needed
Retry Logic: Primarily for connection issues
Conservative Settings: Lower retries and delays

data_engine:
  provider: "ollama"
  model: "mistral:latest"
  rate_limit:
    max_retries: 2       # Minimal retries for local
    base_delay: 0.5      # Short delays
    max_delay: 5.0

Python API¶

Programmatic Configuration¶

from deepfabric import DataSetGenerator

generator = DataSetGenerator(
    generation_system_prompt="You are a helpful AI assistant.",
    provider="gemini",
    model_name="gemini-2.0-flash-exp",
    temperature=0.5,

    # Rate limiting configuration
    rate_limit={
        "max_retries": 7,
        "base_delay": 3.0,
        "max_delay": 180.0,
        "backoff_strategy": "exponential_jitter",
        "exponential_base": 2.0,
        "jitter": True,
        "respect_retry_after": True,
    }
)

Using Provider Defaults¶

# Omit rate_limit to use intelligent defaults
generator = DataSetGenerator(
    generation_system_prompt="You are a helpful AI assistant.",
    provider="gemini",
    model_name="gemini-2.0-flash-exp",
    # rate_limit automatically uses Gemini defaults
)

Advanced: LLMClient Direct Usage¶

from deepfabric.llm import LLMClient
from deepfabric.llm.rate_limit_config import GeminiRateLimitConfig

# Create custom config
config = GeminiRateLimitConfig(
    max_retries=10,
    base_delay=2.0,
    max_delay=300.0,
    backoff_strategy="exponential_jitter",
    parse_quota_details=True,
    daily_quota_aware=True,
)

client = LLMClient(
    provider="gemini",
    model_name="gemini-2.0-flash-exp",
    rate_limit_config=config,
)

Intelligent Features¶

1. Fail-Fast Detection¶

The system detects errors that shouldn't be retried:

Daily Quota Exhaustion: Gemini RPD (requests per day) won't reset for hours
Zero Quota Limit: Indicates account setup issue, not transient

When detected, the system fails immediately rather than wasting time retrying.

# Example Gemini daily quota error:
# "429 RESOURCE_EXHAUSTED. Quota exceeded for metric:
#  generate_requests_per_model_per_day, limit: 0"
#
# System detects "per_day" and "limit: 0", fails fast

2. Provider-Specific Error Parsing¶

Each provider has unique error formats:

OpenAI:

{
  "error": {
    "message": "Rate limit exceeded",
    "type": "rate_limit_error"
  }
}

Extracts: remaining capacity, retry-after

Anthropic:

{
  "error": {
    "type": "rate_limit_error",
    "message": "This request would exceed your organization's rate limit"
  }
}

Extracts: retry-after, token bucket status

Gemini:

{
  "error": {
    "code": 429,
    "status": "RESOURCE_EXHAUSTED",
    "details": [{
      "@type": "type.googleapis.com/google.rpc.QuotaFailure",
      "violations": [{
        "quotaMetric": "generativelanguage.googleapis.com/generate_requests_per_model_per_day"
      }]
    }]
  }
}

Extracts: quota metric type, daily vs minute limits

3. Exponential Backoff with Jitter¶

Prevents thundering herd when multiple requests retry:

# Without jitter:
# Request 1: retry at 2s, 4s, 8s, 16s...
# Request 2: retry at 2s, 4s, 8s, 16s...
# Request 3: retry at 2s, 4s, 8s, 16s...
# All retry simultaneously → triggers more rate limits!

# With jitter (±25%):
# Request 1: retry at 2.1s, 3.8s, 8.5s, 14.2s...
# Request 2: retry at 1.7s, 4.3s, 7.1s, 17.8s...
# Request 3: retry at 2.4s, 3.5s, 8.9s, 15.1s...
# Distributed retries → smooth load

4. Retry-After Header Priority¶

When providers specify wait time, the system uses it:

# OpenAI response headers:
# retry-after: 15

# System uses 15 seconds (capped at max_delay)
# Ignores calculated exponential delay

5. Retryable vs Non-Retryable Errors¶

Retries on: - 429 (rate limit) - 500, 502, 503, 504 (server errors) - Timeout, connection, network errors

Does NOT retry: - 4xx errors (except 429) - client errors - Authentication failures - Invalid API keys - Daily quota exhaustion (Gemini) - Zero quota limit

Monitoring and Logging¶

The system logs retry attempts with detailed information:

WARNING - Rate limit/transient error for gemini on attempt 1, backing off 2.34s (quota_type: requests_per_minute): 429 RESOURCE_EXHAUSTED
WARNING - Rate limit/transient error for gemini on attempt 2, backing off 4.87s (quota_type: requests_per_minute): 429 RESOURCE_EXHAUSTED
ERROR - Giving up after 5 attempts for gemini: 429 RESOURCE_EXHAUSTED

Log Levels¶

WARNING: Retry attempts with backoff duration
ERROR: Giving up after max retries
DEBUG: Header parsing, quota info extraction

Best Practices¶

1. Start with Defaults¶

Begin with provider defaults and adjust based on observed behavior:

data_engine:
  provider: "gemini"
  # Let DeepFabric use intelligent defaults first

2. Monitor Failure Rates¶

DeepFabric tracks failures by category:

generator.print_failure_summary()

# Output:
# === Failure Analysis Summary ===
# Total Failed Samples: 5
#
# Failure Types Breakdown:
# API Errors: 5
#   1. Rate limit exceeded for gemini/gemini-2.0-flash-exp...

3. Adjust for Your Use Case¶

High Volume, Paid Tier:

rate_limit:
  max_retries: 3          # Fail faster
  base_delay: 0.5         # Quick retries
  max_delay: 10.0         # Short max wait

Free Tier, Aggressive Limits:

rate_limit:
  max_retries: 10         # More persistent
  base_delay: 5.0         # Longer initial wait
  max_delay: 300.0        # Patient max wait
  exponential_base: 2.0   # Exponential growth

Local Ollama:

rate_limit:
  max_retries: 2
  base_delay: 0.5
  max_delay: 5.0

4. Use Batch Sizes Wisely¶

Combine batch size with rate limiting:

dataset:
  creation:
    batch_size: 3         # Smaller batches for aggressive limits
    num_steps: 10

5. Enable Jitter in Production¶

Always use jitter for production workloads:

rate_limit:
  backoff_strategy: "exponential_jitter"
  jitter: true

Troubleshooting¶

Issue: Still Hitting Rate Limits¶

Solution 1: Reduce Batch Size

dataset:
  creation:
    batch_size: 1  # Process one at a time

Solution 2: Increase Base Delay

rate_limit:
  base_delay: 5.0  # Wait longer between retries

Solution 3: Check Provider Tier - OpenAI: Verify tier and limits - Anthropic: Check organization tier - Gemini: Confirm usage tier (Free/1/2/3)

Issue: Daily Quota Exhausted (Gemini)¶

The system detects this and fails fast:

ERROR - Failing fast for gemini: 429 RESOURCE_EXHAUSTED (quota_info: QuotaInfo(is_rate_limit=True, quota_type=requests_per_day, daily_quota_exhausted=True))

Solutions: - Wait until midnight Pacific time - Upgrade Gemini tier - Switch to different provider temporarily

Issue: Too Many Retries Wasting Time¶

Solution: Reduce max_retries

rate_limit:
  max_retries: 3  # Give up sooner

Issue: Requests Timing Out¶

Solution: Increase Request Timeout

data_engine:
  request_timeout: 60  # Increase from default 30s

Migration from Legacy `max_retries`¶

The old max_retries parameter is deprecated in favor of rate_limit:

Before:

data_engine:
  max_retries: 5

After:

data_engine:
  rate_limit:
    max_retries: 5
    # Plus many more options...

The old parameter still works but is ignored if rate_limit is present.

Example: Complete Rate Limiting Setup¶

dataset_system_prompt: "You are a helpful AI assistant."

topic_tree:
  topic_prompt: "Python programming"
  provider: "gemini"
  model: "gemini-2.0-flash-exp"
  temperature: 0.7
  degree: 3
  depth: 2

data_engine:
  provider: "gemini"
  model: "gemini-2.0-flash-exp"
  temperature: 0.5
  generation_system_prompt: "You are a Python instructor."

  # Comprehensive rate limiting configuration
  rate_limit:
    max_retries: 7
    base_delay: 3.0
    max_delay: 180.0
    backoff_strategy: "exponential_jitter"
    exponential_base: 2.0
    jitter: true
    respect_retry_after: true

dataset:
  creation:
    num_steps: 20
    batch_size: 2  # Conservative batch size
    sys_msg: true
  save_as: "python_dataset.jsonl"

Rate Limiting & Intelligent Retry¶

Overview¶

Quick Start¶

Using Provider Defaults¶

Custom Rate Limiting¶

Configuration Options¶

max_retries¶

base_delay¶

max_delay¶

backoff_strategy¶

exponential_base¶

jitter¶

respect_retry_after¶

Provider-Specific Behavior¶

OpenAI¶

Anthropic Claude¶

Google Gemini¶

Ollama (Local)¶

Python API¶

Programmatic Configuration¶

Using Provider Defaults¶

Advanced: LLMClient Direct Usage¶

Intelligent Features¶

1. Fail-Fast Detection¶

2. Provider-Specific Error Parsing¶

3. Exponential Backoff with Jitter¶

4. Retry-After Header Priority¶

5. Retryable vs Non-Retryable Errors¶

Monitoring and Logging¶

Log Levels¶

Best Practices¶

1. Start with Defaults¶

2. Monitor Failure Rates¶

3. Adjust for Your Use Case¶

4. Use Batch Sizes Wisely¶

5. Enable Jitter in Production¶

Troubleshooting¶

Issue: Still Hitting Rate Limits¶

Issue: Daily Quota Exhausted (Gemini)¶

Issue: Too Many Retries Wasting Time¶

Issue: Requests Timing Out¶

Migration from Legacy max_retries¶

Example: Complete Rate Limiting Setup¶

References¶

`max_retries`¶

`base_delay`¶

`max_delay`¶

`backoff_strategy`¶

`exponential_base`¶

`jitter`¶

`respect_retry_after`¶

Migration from Legacy `max_retries`¶