4 min read

Managing Crawl Budget & Rate Limiting

Establish Crawl Budget Baselines via Server Log Analysis

Parse origin access logs to isolate bot user-agents. Calculate requests-per-minute (RPM) against HTTP status distributions. Map wasted cycles to 404/5xx endpoints. Define RPM ceilings aligned with CDN cache-hit ratios. Integrate findings into the broader Automated Crawling & Pipeline Tooling architecture to standardize metric collection across environments.

Implementation Steps

  • Aggregate 30-day server logs using ELK or CloudWatch.
  • Filter traffic by known crawler user-agent regex patterns.
  • Calculate RPM per bot and correlate with response codes.
  • Identify parameterized URL clusters consuming >15% of total budget.

Code Example

#!/usr/bin/env python3
import argparse
import json
import re
from collections import defaultdict
from datetime import datetime, timedelta
from pathlib import Path

def parse_logs(log_dir, ua_regex_file, time_window_min):
    ua_patterns = [re.compile(line.strip()) for line in open(ua_regex_file)]
    rpm_tracker = defaultdict(lambda: defaultdict(int))
    status_dist = defaultdict(int)

    for log_file in Path(log_dir).glob("*.log"):
        with open(log_file) as f:
            for line in f:
                # Extract timestamp, UA, status, URL via regex
                ts_str, ua, status, url = extract_fields(line)
                ts = datetime.strptime(ts_str, "%d/%b/%Y:%H:%M:%S %z")

                if any(p.search(ua) for p in ua_patterns):
                    window_key = ts.replace(second=0, microsecond=0)
                    rpm_tracker[ua][window_key] += 1
                    status_dist[status] += 1

    # Aggregate and filter by time window
    results = {
        "rpm_by_bot": {ua: max(counts.values()) for ua, counts in rpm_tracker.items()},
        "status_distribution": dict(status_dist),
        "analysis_window_min": time_window_min
    }
    return results

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--log-dir", required=True)
    parser.add_argument("--ua-regex-file", required=True)
    parser.add_argument("--time-window-min", type=int, default=60)
    parser.add_argument("--output-json", required=True)
    args = parser.parse_args()

    data = parse_logs(args.log_dir, args.ua_regex_file, args.time_window_min)
    with open(args.output_json, "w") as f:
        json.dump(data, f, indent=2)

Common Mistakes

  • Analyzing CDN logs without cross-referencing origin logs for cache misses.
  • Setting static RPM thresholds without accounting for seasonal traffic spikes.
  • Treating all 200 responses as valid indexable content without canonical checks.

Implement Deterministic Rate Limiting & Concurrency Controls

Deploy token-bucket algorithms at the crawler level to enforce strict request pacing. Configure exponential backoff with jitter for 429/503 responses to prevent origin saturation. Adjust concurrency limits based on endpoint weight. This is critical when rendering resource-heavy pages where Configuring Headless Browsers for JS-Heavy Sites dictates higher compute overhead and stricter rate ceilings.

Implementation Steps

  • Set max concurrent connections per domain (default: 5).
  • Implement leaky-bucket queue for request dispatching.
  • Configure adaptive delay: base_delay_ms + (response_time * multiplier).
  • Enforce robots.txt Crawl-delay compliance for non-Google bots.

Code Example

# crawler_throttle_manifest.yaml
rate_limiting:
 algorithm: "token_bucket"
 max_concurrency: 5
 base_delay_ms: 200
 max_retries: 3
 backoff_multiplier: 1.5
 jitter_range_ms: 100
 respect_robots_txt: true
 retry_on_status:
 - 429
 - 502
 - 503
 - 504
 adaptive_delay:
 enabled: true
 response_time_multiplier: 0.5
 max_delay_ms: 5000

Common Mistakes

  • Hardcoding static delays instead of implementing adaptive throttling.
  • Disabling connection pooling, increasing TCP handshake latency.
  • Applying uniform rate limits to static assets and dynamic API endpoints.

Automate Budget Validation in Deployment Workflows

Embed crawl simulation gates into CI/CD pipelines to validate new route structures against budget caps before deployment. Configure pipeline jobs to fail builds if simulated RPM exceeds defined thresholds or if orphaned URL chains are detected. Reference Integrating Custom Crawlers with CI/CD Pipelines for standardized artifact generation and automated reporting workflows.

Implementation Steps

  • Add pre-merge crawl simulation targeting only modified routes.
  • Set pipeline failure conditions on budget threshold breaches.
  • Generate JSON/CSV crawl reports as build artifacts.
  • Schedule incremental off-peak crawls via cron/workflow schedulers.

Code Example

# .github/workflows/crawl-budget-gate.yml
name: Crawl Budget Validation
on:
  pull_request:
    paths:
      - 'routes/**'
      - 'sitemap.xml'

jobs:
  budget-check:
    runs-on: ubuntu-latest
    env:
      CRAWL_CONCURRENCY: 3
      MAX_RPM_THRESHOLD: 120
    steps:
      - uses: actions/checkout@v4
      - name: Run Diff-Based Crawl Simulation
        run: |
          python3 crawler_sim.py \
            --diff-only \
            --concurrency $CRAWL_CONCURRENCY \
            --output report.json
      - name: Validate Budget Thresholds
        run: |
          python3 validate_budget.py --report report.json --max-rpm $MAX_RPM_THRESHOLD
        env:
          fail_on_budget_exceeded: true
        timeout-minutes: 15

Common Mistakes

  • Executing full-site crawls on every PR instead of targeted diff-based scans.
  • Failing to cache crawl results, causing redundant budget consumption.
  • Not isolating staging rate limits from production baselines.

Filter Dynamic Parameters & Normalize URL Space

Reduce wasted crawl cycles by deduplicating parameterized URLs and prioritizing canonical paths. Implement regex-based stripping for session IDs, tracking parameters, and sort filters. Map high-value content paths to priority queues. For SPAs and infinite-scroll architectures, apply client-side routing filters as detailed in Handling Dynamic Content in Automated Crawls to prevent infinite URL generation.

Implementation Steps

  • Parse query strings to identify high-cardinality parameters.
  • Apply normalization rules: strip UTM, lowercase paths, remove trailing slashes.
  • Configure crawler to ignore parameter permutations exceeding threshold.
  • Assign priority scores to core content vs. faceted navigation paths.

Code Example

#!/usr/bin/env python3
import argparse
import json
import re
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse

def normalize_urls(url_list, param_blacklist, canonicalize, priority_rules):
    normalized = []
    rules = json.load(open(priority_rules))

    for raw_url in open(url_list):
        parsed = urlparse(raw_url.strip())
        qs = parse_qs(parsed.query)

        # Strip blacklisted params
        filtered_qs = {k: v for k, v in qs.items() if k not in param_blacklist}
        new_query = urlencode(filtered_qs, doseq=True) if filtered_qs else ""

        # Canonicalize
        if canonicalize:
            new_path = parsed.path.lower().rstrip("/")
            new_query = new_query.lower()

        final_url = urlunparse((parsed.scheme, parsed.netloc, new_path, parsed.params, new_query, parsed.fragment))

        # Assign priority
        priority = next((r["score"] for r in rules if re.match(r["pattern"], final_url)), 0)
        normalized.append({"url": final_url, "priority": priority})

    return sorted(normalized, key=lambda x: x["priority"], reverse=True)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--url-list", required=True)
    parser.add_argument("--param-blacklist", nargs="+", default=["utm_source", "utm_medium", "sid"])
    parser.add_argument("--canonicalize", action="store_true")
    parser.add_argument("--priority-rules-json", required=True)
    args = parser.parse_args()

    results = normalize_urls(args.url_list, args.param_blacklist, args.canonicalize, args.priority_rules_json)
    for r in results:
        print(json.dumps(r))

Common Mistakes

  • Blocking entire parameter sets via robots.txt instead of normalizing at the crawler level.
  • Failing to account for client-side routing generating infinite scroll URLs.
  • Over-normalizing and stripping legitimate faceted navigation parameters.