15 min read

Managing Crawl Budget & Rate Limiting

Q: What is the safest default max concurrency for a production crawler?

Start at 3–5 concurrent connections per domain and measure origin response times. Raise the ceiling by 1 only after confirming average response time stays below 500 ms and error rates remain under 1%.

Q: How do I prevent crawl budget waste on parameterized e-commerce URLs?

Strip high-cardinality parameters (session IDs, sort orders, tracking tokens) at the crawler level before queuing URLs. Maintain a canonical URL set and discard normalized duplicates before they enter the request queue.

Q: When should I use a leaky-bucket vs a token-bucket algorithm?

Use a token-bucket when bursting is acceptable for short periods (e.g. CDN-cached static assets). Use a leaky-bucket to enforce a perfectly uniform outbound rate — preferred for origin-direct crawls where even brief bursts trigger 429s.

Q: How do I validate crawl budget gates without running a full site crawl on every PR?

Run diff-based crawl simulations that target only routes modified in the pull request, then fail the build if simulated RPM exceeds the per-domain threshold or orphaned URL chains are detected.

Without explicit rate controls, automated crawlers become a liability: unthrottled request bursts saturate origin thread pools, push response times into the seconds, and trigger CDN circuit breakers — all before a single audit insight has been captured. SREs and SEO engineers both pay the cost: site health degrades during the very process meant to measure it. This page is part of the broader Automated Crawling & Pipeline Tooling framework and covers every layer of control needed to keep crawlers polite and purposeful.

Prerequisites & Environment Setup

Tool / Library	Pinned Version	Purpose
Python	3.11+	Log parsing, URL normalization scripts
PyYAML	6.0.1	Throttle manifest loading
requests / httpx	httpx 0.27	Async HTTP client with connection pooling
flock (util-linux)	2.39+	Concurrency guard for cron-launched crawlers
GitHub Actions runner	ubuntu-22.04	CI budget gate execution environment

Required environment variables before running any script below:

export CRAWL_LOG_DIR="/var/log/nginx/crawl-audit"
export UA_REGEX_FILE="/etc/crawler/ua_patterns.txt"
export CRAWL_OUTPUT_DIR="/data/crawl-artifacts"
export MAX_RPM_THRESHOLD=120
export MAX_CONCURRENCY=5

Step 1 — Establish Crawl Budget Baselines via Server Log Analysis

What breaks without this: without a measured RPM baseline, any rate ceiling you pick is a guess. A ceiling set too low wastes crawl windows; one set too high pushes the origin into thread saturation and produces 503s that silently poison your audit data.

Parse origin access logs (not CDN edge logs, which hide cache misses) to isolate bot user-agents. Calculate requests-per-minute against HTTP status distributions. Map wasted cycles to 404/5xx endpoints. Define RPM ceilings aligned with CDN cache-hit ratios.

#!/usr/bin/env python3
"""
parse_crawl_logs.py — baseline RPM and status distribution per crawler UA.

Usage:
  python3 parse_crawl_logs.py \
    --log-dir /var/log/nginx/crawl-audit \
    --ua-regex-file /etc/crawler/ua_patterns.txt \
    --time-window-min 60 \
    --output-json /data/crawl-artifacts/baseline_$(date +%Y%m%d).json
"""
import argparse
import json
import re
from collections import defaultdict
from datetime import datetime
from pathlib import Path

# Combined Log Format field extraction
LOG_RE = re.compile(
    r'(?P<ip>\S+) \S+ \S+ \[(?P<ts>[^\]]+)\] "(?:\S+) (?P<url>\S+) [^"]*" '
    r'(?P<status>\d{3}) \d+ "[^"]*" "(?P<ua>[^"]*)"'
)


def parse_logs(log_dir: str, ua_regex_file: str, time_window_min: int) -> dict:
    ua_patterns = [
        re.compile(line.strip())
        for line in open(ua_regex_file)
        if line.strip() and not line.startswith("#")
    ]
    rpm_tracker: dict = defaultdict(lambda: defaultdict(int))
    status_dist: dict = defaultdict(int)
    url_budget_waste: dict = defaultdict(int)

    for log_file in sorted(Path(log_dir).glob("*.log")):
        with open(log_file) as fh:
            for line in fh:
                m = LOG_RE.match(line)
                if not m:
                    continue
                ua = m.group("ua")
                if not any(p.search(ua) for p in ua_patterns):
                    continue

                ts = datetime.strptime(m.group("ts"), "%d/%b/%Y:%H:%M:%S %z")
                status = m.group("status")
                url = m.group("url")

                window_key = ts.replace(second=0, microsecond=0)
                rpm_tracker[ua][window_key] += 1
                status_dist[status] += 1

                if status.startswith(("4", "5")):
                    url_budget_waste[url] += 1

    # Top-20 waste URLs (4xx/5xx with highest hit count)
    top_waste = sorted(url_budget_waste.items(), key=lambda x: x[1], reverse=True)[:20]

    return {
        "rpm_by_bot": {ua: max(counts.values()) for ua, counts in rpm_tracker.items()},
        "status_distribution": dict(status_dist),
        "top_waste_urls": [{"url": u, "error_hits": c} for u, c in top_waste],
        "analysis_window_min": time_window_min,
    }


if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--log-dir", required=True)
    parser.add_argument("--ua-regex-file", required=True)
    parser.add_argument("--time-window-min", type=int, default=60)
    parser.add_argument("--output-json", required=True)
    args = parser.parse_args()

    data = parse_logs(args.log_dir, args.ua_regex_file, args.time_window_min)
    with open(args.output_json, "w") as fh:
        json.dump(data, fh, indent=2)
    print(f"Baseline written: {args.output_json}")

Common mistakes:

Analyzing CDN edge logs instead of origin logs — cache hits are invisible there, making RPM look artificially low.
Setting static RPM thresholds without seasonal adjustment — a ceiling measured during low-traffic periods will breach origin limits at peak.
Treating all 200 responses as healthy indexable content without checking canonical headers first.

Step 2 — Core Configuration: Implement Rate Limiting & Concurrency Controls

Key parameters table:

Parameter	Type	Default	Purpose
`algorithm`	string	`token_bucket`	Request pacing algorithm (`token_bucket` or `leaky_bucket`)
`max_concurrency`	int	`5`	Maximum simultaneous open connections per domain
`base_delay_ms`	int	`200`	Minimum inter-request pause in milliseconds
`backoff_multiplier`	float	`1.5`	Exponential backoff coefficient on 429/503
`jitter_range_ms`	int	`100`	Random jitter added to each delay to avoid thundering-herd
`max_delay_ms`	int	`5000`	Hard ceiling on computed adaptive delay
`respect_robots_txt`	bool	`true`	Enforce `Crawl-delay` directives from robots.txt
`retry_on_status`	list	`[429,502,503,504]`	HTTP status codes that trigger backoff-and-retry
`adaptive_delay.enabled`	bool	`true`	Scale delay based on observed response time
`adaptive_delay.response_time_multiplier`	float	`0.5`	Fraction of last response time added to base delay

Deploy a token-bucket algorithm at the crawler level to enforce strict request pacing. Configure exponential backoff with jitter for 429/503 responses to prevent origin saturation. When rendering resource-heavy pages (as covered in Configuring Headless Browsers for JS-Heavy Sites), headless browser instances carry 5–10× the memory overhead of plain HTTP fetchers — lower max_concurrency accordingly.

# /etc/crawler/throttle_manifest.yaml
# Production rate-limiting manifest — adjust per domain after Step 1 baseline.
rate_limiting:
  algorithm: "token_bucket"
  max_concurrency: 5
  base_delay_ms: 200
  max_retries: 3
  backoff_multiplier: 1.5
  jitter_range_ms: 100
  respect_robots_txt: true
  retry_on_status:
    - 429
    - 502
    - 503
    - 504
  adaptive_delay:
    enabled: true
    response_time_multiplier: 0.5
    max_delay_ms: 5000

# Per-domain overrides — values here supersede the defaults above.
domain_overrides:
  "api.example.com":
    max_concurrency: 2
    base_delay_ms: 500
  "cdn.example.com":
    max_concurrency: 10
    base_delay_ms: 50

Common mistakes:

Hardcoding static delays instead of enabling adaptive throttling — origin response times shift with load.
Disabling connection pooling, which forces a fresh TCP handshake per request and adds ~50 ms of avoidable latency per URL.
Applying uniform rate limits to static CDN assets and dynamic origin endpoints — they have completely different capacity profiles.

Step 3 — Execution & Scheduling: Automate Budget Validation in Deployment Workflows

Embed crawl simulation gates into your CI/CD pipeline as a pre-merge check. The gate runs a diff-based crawl simulation against only the routes modified in the pull request, then fails the build if simulated RPM exceeds the per-domain ceiling or if orphaned URL chains are detected. This is the enforcement counterpart to Integrating Custom Crawlers with CI/CD Pipelines, which covers full-run artifact generation and downstream reporting.

Use flock to guard against concurrent cron-triggered crawls colliding with CI-triggered ones:

#!/usr/bin/env bash
# /usr/local/bin/run_crawl_budget_gate.sh
# Called by both cron and GitHub Actions. flock ensures only one instance runs at a time.
set -euo pipefail

LOCK_FILE="/var/run/crawl_budget_gate.lock"
CRAWL_SIM="/opt/crawler/crawler_sim.py"
VALIDATE="/opt/crawler/validate_budget.py"
REPORT="/data/crawl-artifacts/budget_report_$(date +%Y%m%d_%H%M%S).json"

exec 9>"${LOCK_FILE}"
flock --nonblock 9 || { echo "Another crawl gate is running. Exiting."; exit 0; }

python3 "${CRAWL_SIM}" \
  --diff-only \
  --concurrency "${MAX_CONCURRENCY:-3}" \
  --output "${REPORT}"

python3 "${VALIDATE}" \
  --report "${REPORT}" \
  --max-rpm "${MAX_RPM_THRESHOLD:-120}"

GitHub Actions workflow that calls this gate on every PR touching routes or sitemaps:

# .github/workflows/crawl-budget-gate.yml
name: Crawl Budget Validation
on:
  pull_request:
    paths:
      - 'routes/**'
      - 'sitemap.xml'

jobs:
  budget-check:
    runs-on: ubuntu-22.04
    env:
      MAX_CONCURRENCY: "3"
      MAX_RPM_THRESHOLD: "120"
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python 3.11
        uses: actions/setup-python@v5
        with:
          python-version: "3.11"
          cache: "pip"

      - name: Install dependencies
        run: pip install -r requirements.lock

      - name: Run diff-based crawl simulation
        run: |
          python3 /opt/crawler/crawler_sim.py \
            --diff-only \
            --concurrency "$MAX_CONCURRENCY" \
            --output report.json
        timeout-minutes: 15

      - name: Validate budget thresholds
        run: |
          python3 /opt/crawler/validate_budget.py \
            --report report.json \
            --max-rpm "$MAX_RPM_THRESHOLD"

      - name: Upload budget report artifact
        if: always()
        uses: actions/upload-artifact@v4
        with:
          name: crawl-budget-report-${{ github.sha }}
          path: report.json
          retention-days: 30

Cron schedule for off-peak incremental crawls (Sunday 02:00 UTC, with flock guard):

0 2 * * 0  crawler  /usr/local/bin/run_crawl_budget_gate.sh >> /var/log/crawler/budget_gate.log 2>&1

Common mistakes:

Running a full-site crawl on every PR — this exhausts the weekly crawl budget in minutes and blocks the team.
Not caching crawl results between runs, triggering redundant budget consumption on unchanged routes.
Using identical rate-limit settings for staging and production environments — staging origins are typically under-provisioned and will 429 at thresholds that production handles comfortably.

Step 4 — Filter Dynamic Parameters & Normalize URL Space

Before URLs enter the request queue, strip high-cardinality parameters and deduplicate canonical forms. A single faceted e-commerce category page can generate tens of thousands of parameter permutations (?sort=price&color=blue&page=3) that are functionally identical for audit purposes. For SPA and infinite-scroll architectures, apply client-side routing filters as detailed in Handling Dynamic Content in Automated Crawls to prevent the URL queue growing unbounded.

#!/usr/bin/env python3
"""
normalize_urls.py — strip blacklisted params, canonicalize paths, assign crawl priority.

Usage:
  python3 normalize_urls.py \
    --url-list /data/crawl-artifacts/raw_urls.txt \
    --canonicalize \
    --priority-rules-json /etc/crawler/priority_rules.json \
    --output-json /data/crawl-artifacts/normalized_urls.json
"""
import argparse
import json
import re
from urllib.parse import urlparse, parse_qs, urlencode, urlunparse

# Default high-cardinality params to strip before crawling.
DEFAULT_BLACKLIST = frozenset([
    "utm_source", "utm_medium", "utm_campaign", "utm_content", "utm_term",
    "sid", "sessionid", "PHPSESSID", "jsessionid",
    "sort", "order", "page", "ref", "fbclid", "gclid", "_ga",
])


def normalize_url(raw_url: str, blacklist: frozenset, canonicalize: bool) -> str:
    parsed = urlparse(raw_url.strip())
    qs = parse_qs(parsed.query, keep_blank_values=False)
    filtered_qs = {k: v for k, v in qs.items() if k not in blacklist}
    new_query = urlencode(filtered_qs, doseq=True)

    path = parsed.path.lower().rstrip("/") if canonicalize else parsed.path
    if canonicalize:
        new_query = new_query.lower()

    return urlunparse((
        parsed.scheme, parsed.netloc, path,
        parsed.params, new_query, ""  # strip fragment
    ))


def assign_priority(url: str, rules: list) -> int:
    for rule in rules:
        if re.match(rule["pattern"], url):
            return rule["score"]
    return 0


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--url-list", required=True)
    parser.add_argument("--extra-blacklist", nargs="+", default=[])
    parser.add_argument("--canonicalize", action="store_true")
    parser.add_argument("--priority-rules-json", required=True)
    parser.add_argument("--output-json", required=True)
    args = parser.parse_args()

    blacklist = DEFAULT_BLACKLIST | frozenset(args.extra_blacklist)
    rules = json.load(open(args.priority_rules_json))

    seen: set = set()
    results: list = []

    with open(args.url_list) as fh:
        for line in fh:
            url = normalize_url(line, blacklist, args.canonicalize)
            if url in seen or not url:
                continue
            seen.add(url)
            results.append({"url": url, "priority": assign_priority(url, rules)})

    results.sort(key=lambda x: x["priority"], reverse=True)

    with open(args.output_json, "w") as fh:
        json.dump(results, fh, indent=2)
    print(f"Normalized {len(results)} unique URLs → {args.output_json}")


if __name__ == "__main__":
    main()

Example priority_rules.json:

[
  {"pattern": "^https://example\\.com/products/[^?]+$",  "score": 100},
  {"pattern": "^https://example\\.com/categories/[^?]+$", "score": 80},
  {"pattern": "^https://example\\.com/blog/[^?]+$",       "score": 60},
  {"pattern": "^https://example\\.com/$",                 "score": 50},
  {"pattern": ".*",                                        "score": 10}
]

Common mistakes:

Blocking entire parameter namespaces in robots.txt rather than normalizing at the crawler level — Googlebot may still discover and index those URLs through external links.
Failing to strip fragment identifiers (#section-id) from queued URLs — they produce false duplicate counts in the deduplication set.
Over-normalizing by stripping legitimate faceted navigation parameters that represent genuinely distinct content (e.g. /products?material=leather where material is a canonical dimension, not a sort order).

Artifact Capture & Storage

After each crawl gate run, persist the normalized URL list, status distribution, and RPM summary as JSON artifacts. Pair this with the versioned storage practices in Storing & Versioning Crawl Artifacts in Cloud Storage so historical budget baselines are available for trend analysis.

Recommended artifact layout:

/data/crawl-artifacts/
  YYYY-MM-DD/
    baseline_HHMMSS.json        # Step 1 log analysis output
    normalized_urls_HHMMSS.json # Step 4 URL normalizer output
    budget_report_HHMMSS.json   # CI gate validation output
    crawler.log                 # Raw crawler stdout/stderr

Retain 90 days of daily artifacts, 12 months of weekly snapshots. Checksum each file with sha256sum and store the digest alongside the artifact for integrity verification.

Verification Checklist

Run these steps after deploying rate-limit controls to confirm the workflow is operating correctly:

Baseline RPM matches expectations. Run parse_crawl_logs.py against the latest 24-hour origin log window. Confirm rpm_by_bot values fall within 80–90% of the configured MAX_RPM_THRESHOLD.
No 429s in the crawl log. grep -c "429" /var/log/crawler/budget_gate.log should return 0 for a correctly throttled run.
Deduplication ratio is healthy. The ratio of normalized unique URLs to raw input URLs should be 0.5–0.8 for a typical e-commerce site. A ratio above 0.95 suggests the URL blacklist is not catching high-cardinality params.
CI gate fires on threshold breach. In a test branch, raise CRAWL_CONCURRENCY above MAX_RPM_THRESHOLD; confirm the GitHub Actions job exits non-zero.
Artifacts are present and checksummed. Verify today's artifact directory contains baseline_*.json, normalized_urls_*.json, and budget_report_*.json. Confirm sha256sum -c checksums.txt passes.
flock guard prevents double-runs. Launch run_crawl_budget_gate.sh twice in parallel. Confirm the second invocation logs "Another crawl gate is running." and exits cleanly.

Troubleshooting

1. Origin returns 429s mid-crawl despite throttle manifest

Root cause: Adaptive delay is reading response times from cached requests (< 10 ms), so the computed delay collapses to near zero, effectively removing the throttle.

Fix: Exclude CDN-served responses from the adaptive delay input by checking for X-Cache: HIT headers before feeding response time into the delay multiplier.

delay_input_ms = response_time_ms if "x-cache" not in response.headers else base_delay_ms

2. `flock: /var/run/crawl_budget_gate.lock: Permission denied`

Root cause: The lock file was created by a different user (e.g. root during a manual run) and the cron user lacks write access.

Fix:

sudo rm -f /var/run/crawl_budget_gate.lock
sudo chown crawler:crawler /var/run/crawl_budget_gate.lock
# Or move lock to the crawler user's home:
export LOCK_FILE="/home/crawler/run/crawl_budget_gate.lock"
mkdir -p "$(dirname "$LOCK_FILE")"

3. Normalized URL count grows unbounded after deploying SPA

Root cause: Client-side routing generates unique hash fragments or ?modal= parameters that bypass the static blacklist.

Fix: Add a normalization step that strips all URL fragments and adds SPA-specific parameters to the blacklist. For infinite-scroll detection, apply the intercept pattern described in Handling Dynamic Content in Automated Crawls.

DEFAULT_BLACKLIST |= frozenset(["modal", "overlay", "drawer", "tab", "panel"])

4. CI budget gate passes but origin shows post-deploy 503 burst

Root cause: The diff-based simulation uses MAX_CONCURRENCY=3 while the production deployment triggers a full warm-up crawl at the default MAX_CONCURRENCY=5 against an under-provisioned staging origin used as the simulation target.

Fix: Set CRAWL_CONCURRENCY in the CI workflow to match the production crawler config, and point the simulation at the staging environment with the same resource allocation as production.

5. `parse_crawl_logs.py` produces empty `rpm_by_bot`

Root cause: The ua_patterns.txt regex patterns do not match the actual User-Agent strings in the log file (common when using compressed or rotated log archives with different filename patterns).

Fix:

# Verify the log glob pattern finds files:
ls /var/log/nginx/crawl-audit/*.log

# Test UA pattern matching against a sample line:
python3 -c "
import re
patterns = [re.compile(l.strip()) for l in open('/etc/crawler/ua_patterns.txt') if l.strip()]
sample_ua = 'Mozilla/5.0 (compatible; AuditBot/1.0)'
print(any(p.search(sample_ua) for p in patterns))
"

6. GitHub Actions upload step fails with `retention-days` error

Root cause: Repository is on the free plan where the maximum artifact retention is 90 days; 90+ days causes a schema validation error in upload-artifact@v4.

Fix: Set retention-days: 30 or reduce to match the plan limit. The budget report JSON should be mirrored to cloud storage (S3/R2) for longer-term retention regardless.

FAQ

What is the safest default max_concurrency for a production crawler?

Start at 3–5 concurrent connections per domain and measure origin response times at that level. Raise the ceiling by 1 only after confirming average response time stays below 500 ms and error rates remain under 1%. Never increase concurrency after a 429 — wait for the adaptive delay to stabilize first.

How do I prevent crawl budget waste on parameterized e-commerce URLs?

Strip high-cardinality parameters (session IDs, sort orders, tracking tokens) at the crawler level before queuing URLs. Maintain a canonical URL set using the normalize_urls.py script above, and discard any URL whose normalized form is already in the seen set.

When should I use a leaky-bucket vs a token-bucket algorithm?

Use a token-bucket when short bursts against CDN-cached content are acceptable — the bucket refills at a fixed rate and allows accumulated tokens to be spent quickly. Use a leaky-bucket to enforce a perfectly uniform outbound rate; preferred for origin-direct crawls where even brief bursts trigger 429s or breach rate-limit headers.

How do I validate crawl budget gates without running a full site crawl on every PR?

Use the --diff-only flag in the simulation step, which resolves only the URL paths touched by files changed in the pull request's diff. This keeps CI job time under 5 minutes for typical PRs and avoids exhausting the weekly crawl allocation on test runs.

Automated Crawling & Pipeline Tooling — parent section covering the full crawler architecture
Configuring Headless Browsers for JS-Heavy Sites — concurrency and memory implications of browser-based crawling
Integrating Custom Crawlers with CI/CD Pipelines — standardized artifact generation and reporting workflows
Storing & Versioning Crawl Artifacts in Cloud Storage — retention policies and versioning for budget baseline data
Handling Dynamic Content in Automated Crawls — SPA routing and infinite-scroll detection
Throttling Crawls to Respect Server Rate Limits — per-host token buckets and Retry-After handling

Managing Crawl Budget & Rate Limiting #

Prerequisites & Environment Setup #

Step 1 — Establish Crawl Budget Baselines via Server Log Analysis #

Step 2 — Core Configuration: Implement Rate Limiting & Concurrency Controls #

Step 3 — Execution & Scheduling: Automate Budget Validation in Deployment Workflows #

Step 4 — Filter Dynamic Parameters & Normalize URL Space #

Artifact Capture & Storage #

Verification Checklist #

Troubleshooting #

1. Origin returns 429s mid-crawl despite throttle manifest #

2. flock: /var/run/crawl_budget_gate.lock: Permission denied #

3. Normalized URL count grows unbounded after deploying SPA #

4. CI budget gate passes but origin shows post-deploy 503 burst #

5. parse_crawl_logs.py produces empty rpm_by_bot #

6. GitHub Actions upload step fails with retention-days error #

Related #

Managing Crawl Budget & Rate Limiting

Prerequisites & Environment Setup

Step 1 — Establish Crawl Budget Baselines via Server Log Analysis

Step 2 — Core Configuration: Implement Rate Limiting & Concurrency Controls

Step 3 — Execution & Scheduling: Automate Budget Validation in Deployment Workflows

Step 4 — Filter Dynamic Parameters & Normalize URL Space

Artifact Capture & Storage

Verification Checklist

Troubleshooting

1. Origin returns 429s mid-crawl despite throttle manifest

2. `flock: /var/run/crawl_budget_gate.lock: Permission denied`

3. Normalized URL count grows unbounded after deploying SPA

4. CI budget gate passes but origin shows post-deploy 503 burst

5. `parse_crawl_logs.py` produces empty `rpm_by_bot`

6. GitHub Actions upload step fails with `retention-days` error

Related