13 min read

Defining Crawl Depth & Scope for Enterprise Sites

Q: How do I prevent parameter-driven URL bloat from inflating crawl scope?

Define regex exclude_patterns for known parameter prefixes (utm_, sort=, ref=, page=). Combine with canonical normalization so the crawler treats parameter variants as a single canonical URL.

Q: Should I crawl staging and production in the same pipeline run?

No. Use separate allowed_domains lists and separate config files. Mixing environments inflates crawl scope and produces misleading baseline metrics.

Q: How do I handle subdomains with different content architectures?

Assign per-subdomain budget_cap and max_depth overrides in your pipeline config. A documentation subdomain (docs.example.com) may require depth 6, while a marketing site stays at depth 3.

Without a deterministic scope definition, enterprise crawlers over-collect parameterized junk URLs, re-crawl canonicalized duplicates, and exhaust crawl budget on staging paths that should never be indexed. The result is noisy audit data, misleading coverage metrics, and production latency spikes from unbounded concurrent requests. SREs, SEO engineers, and technical leads responsible for automated audit pipelines are the primary audience for this workflow — see Technical Audit Fundamentals & Scope Mapping for the full context in which scope definition sits.

Pipeline stages overview

Prerequisites & environment setup

The following versions and environment variables are pinned to ensure reproducibility across CI runs:

Dependency	Minimum version	Install
Python	3.11	`pyenv install 3.11`
Scrapy	2.11.2	`pip install scrapy==2.11.2`
Playwright (headless)	1.44.0	`pip install playwright==1.44.0 && playwright install chromium`
jq	1.7	system package
flock	util-linux 2.38	system package

Required environment variables — export these before executing any pipeline step:

export CRAWL_ROOT="/var/lib/site-audit"
export CRAWL_DOMAIN="example.com"
export CRAWL_ENV="production"          # production | staging
export CRAWL_MAX_DEPTH=4
export CRAWL_CONCURRENCY=6
export CRAWL_DELAY_MS=250
export CRAWL_BUDGET_WWW=50000
export CRAWL_BUDGET_BLOG=20000
export CRAWL_OUTPUT_DIR="${CRAWL_ROOT}/crawls/$(date -u +%Y-%m-%dT%H:%M:%SZ)"

Lock your Python dependencies:

pip freeze > "${CRAWL_ROOT}/requirements.lock"

Step 1 — Initialization: boundary definition

Establish crawl boundaries by ingesting all XML sitemaps and parsing robots.txt directives. Resolving canonical chains at this stage — before the crawler fires — prevents scope inflation from 301/302 redirect loops and self-referential canonicals. Use how to map URL hierarchies before running a crawl to calculate directory depth and identify parent-child routing patterns before setting max_depth.

#!/usr/bin/env python3
# /var/lib/site-audit/scripts/init_scope.py
"""
Parse sitemaps, strip parameter noise, resolve canonicals,
and emit a deduplicated scope manifest for the crawler.
"""
import re
import hashlib
import xml.etree.ElementTree as ET
from urllib.parse import urlparse, urljoin
from pathlib import Path
import os

DOMAIN        = os.environ["CRAWL_DOMAIN"]
OUTPUT_DIR    = Path(os.environ["CRAWL_OUTPUT_DIR"])
MAX_DEPTH     = int(os.environ.get("CRAWL_MAX_DEPTH", 4))
SITEMAP_URLS  = [
    f"https://{DOMAIN}/sitemap.xml",
    f"https://blog.{DOMAIN}/sitemap.xml",
]
EXCLUDE_PATTERNS = [
    r"^/staging/",
    r"^/dev/",
    r"\?utm_",
    r"\?sort=",
    r"\?ref=",
    r"\?page=\d{3,}",   # pagination beyond 3 digits
]

OUTPUT_DIR.mkdir(parents=True, exist_ok=True)


def calculate_url_depth(url: str) -> int:
    segments = [s for s in urlparse(url).path.split("/") if s]
    return len(segments)


def apply_scope_filters(url: str) -> bool:
    for pattern in EXCLUDE_PATTERNS:
        if re.search(pattern, url):
            return False
    depth = calculate_url_depth(url)
    return depth <= MAX_DEPTH


def url_fingerprint(url: str) -> str:
    return hashlib.sha256(url.encode()).hexdigest()


def parse_sitemap(raw_xml: str, base: str) -> list[str]:
    ns = {"sm": "http://www.sitemaps.org/schemas/sitemap/0.9"}
    root = ET.fromstring(raw_xml)
    urls: list[str] = []
    for loc in root.findall(".//sm:loc", ns):
        if loc.text:
            urls.append(urljoin(base, loc.text.strip()))
    return urls


def build_scope_manifest() -> None:
    seen: set[str] = set()
    accepted: list[str] = []

    # In production, fetch each sitemap URL over HTTP
    # Here we illustrate the parsing logic with a placeholder loader
    for sitemap_url in SITEMAP_URLS:
        # raw = requests.get(sitemap_url, timeout=10).text
        raw = "<urlset xmlns='http://www.sitemaps.org/schemas/sitemap/0.9'></urlset>"
        for url in parse_sitemap(raw, sitemap_url):
            fp = url_fingerprint(url)
            if fp in seen:
                continue
            seen.add(fp)
            if apply_scope_filters(url):
                accepted.append(url)

    manifest_path = OUTPUT_DIR / "scope_manifest.txt"
    manifest_path.write_text("\n".join(sorted(accepted)))
    print(f"Scope manifest written: {len(accepted)} URLs → {manifest_path}")


if __name__ == "__main__":
    build_scope_manifest()

Key metrics from this stage:

Metric	Meaning
`sitemap_coverage_ratio`	Accepted URLs ÷ total sitemap URLs
`robots_disallow_count`	Paths blocked by `robots.txt`
`max_depth_distribution`	Count of URLs per depth tier
`canonical_resolution_rate`	Canonicals resolved without error

Step 2 — Core configuration: parameters & production values

This is the central configuration file committed to version control. All values are overridden by environment variables at runtime — CI injects secrets; the YAML file holds only structural defaults. Before deploying the crawler, configure rate limiting to align concurrency with CDN edge caching rules and prevent origin degradation.

Parameter	Type	Default	Purpose
`max_depth`	int	`4`	Maximum URL depth traversed. Above 7 causes exponential explosion.
`concurrent_requests`	int	`6`	Parallel HTTP requests. Scale with server headroom.
`delay_ms`	int	`250`	Fixed inter-request delay in milliseconds. Overridden by `Crawl-Delay` in `robots.txt`.
`allowed_domains`	list[str]	`[]`	Domain allowlist. Any URL outside this list is discarded.
`exclude_patterns`	list[str]	`[]`	Regex patterns for URL exclusion applied before queueing.
`budget_cap_per_subdomain`	dict[str, int]	`{}`	Hard URL cap per subdomain. Prevents one section exhausting the full budget.
`backoff_initial_ms`	int	`1000`	Starting delay for exponential backoff on 429 / 5xx responses.
`backoff_max_ms`	int	`30000`	Maximum delay cap for exponential backoff.
`session_persistence`	bool	`false`	Maintain session cookies for authenticated routes.

# /var/lib/site-audit/config/crawler_config.yaml
# All secrets injected via environment variables at runtime.
pipeline:
  max_depth: ${CRAWL_MAX_DEPTH:-4}
  concurrent_requests: ${CRAWL_CONCURRENCY:-6}
  delay_ms: ${CRAWL_DELAY_MS:-250}
  backoff_initial_ms: 1000
  backoff_max_ms: 30000
  session_persistence: false

  allowed_domains:
    - "${CRAWL_DOMAIN}"
    - "cdn.${CRAWL_DOMAIN}"
    - "blog.${CRAWL_DOMAIN}"

  exclude_patterns:
    - "^/staging/"
    - "^/dev/"
    - "^/.*\\?utm_"
    - "^/.*\\?sort="
    - "^/.*\\?ref="
    - "^/.*\\?page=[0-9]{3,}"

  budget_cap_per_subdomain:
    "www.${CRAWL_DOMAIN}": ${CRAWL_BUDGET_WWW:-50000}
    "blog.${CRAWL_DOMAIN}": ${CRAWL_BUDGET_BLOG:-20000}

output:
  dir: "${CRAWL_OUTPUT_DIR}"
  format: "parquet"          # parquet | json
  compress: true
  retention_days: 90

Step 3 — Execution & scheduling

Run the crawler in staging mode first. Production execution is triggered via cron with flock to prevent overlapping runs. Timezone is explicitly set to UTC — any TTFB or timestamp drift from local timezone offsets will corrupt trend comparisons when tracking metric trends across release cycles.

#!/usr/bin/env bash
# /var/lib/site-audit/scripts/run_crawl.sh
set -euo pipefail

LOCK_FILE="/var/lock/site-audit-crawl.lock"
SCRIPT_DIR="/var/lib/site-audit/scripts"
CONFIG="/var/lib/site-audit/config/crawler_config.yaml"
LOG_DIR="${CRAWL_OUTPUT_DIR}/logs"

mkdir -p "${LOG_DIR}"

# Prevent concurrent runs; exit 0 if already locked (cron-safe)
exec 9>"${LOCK_FILE}"
if ! flock -n 9; then
  echo "[$(date -u +%FT%TZ)] Crawl already in progress, exiting." | tee -a "${LOG_DIR}/run.log"
  exit 0
fi

echo "[$(date -u +%FT%TZ)] Starting crawl: ${CRAWL_DOMAIN} (${CRAWL_ENV})" | tee -a "${LOG_DIR}/run.log"

# Step 1: build scope manifest
TZ=UTC python3 "${SCRIPT_DIR}/init_scope.py" 2>&1 | tee -a "${LOG_DIR}/init.log"

# Step 2: execute Scrapy crawler against manifest
TZ=UTC scrapy crawl enterprise_crawler \
  --set CRAWL_CONFIG="${CONFIG}" \
  --set SCOPE_MANIFEST="${CRAWL_OUTPUT_DIR}/scope_manifest.txt" \
  --logfile "${LOG_DIR}/scrapy.log" \
  --loglevel INFO

# Step 3: run headless browser validation for JS-dependent routes
TZ=UTC python3 "${SCRIPT_DIR}/headless_validate.py" \
  --input "${CRAWL_OUTPUT_DIR}/scope_manifest.txt" \
  --output "${CRAWL_OUTPUT_DIR}/js_render_results.json" \
  2>&1 | tee -a "${LOG_DIR}/headless.log"

echo "[$(date -u +%FT%TZ)] Crawl complete." | tee -a "${LOG_DIR}/run.log"

Cron schedule — fire at 02:00 UTC every Sunday:

0 2 * * 0 TZ=UTC /var/lib/site-audit/scripts/run_crawl.sh >> /var/log/site-audit-crawl.log 2>&1

Step 4 — Artifact capture & storage

Normalize HTTP status codes, render-time metrics, and redirect chains into a standardized Parquet schema before downstream ETL consumes them. Cross-reference initial outputs with establishing baseline health metrics for new domains to calibrate alert thresholds and detect scope drift before scaling to production. Storing and versioning crawl artifacts in object storage completes the retention chain.

#!/usr/bin/env python3
# /var/lib/site-audit/scripts/normalize_artifacts.py
"""
Read raw Scrapy JSON output, normalize status codes,
compute depth-tier TTFB percentiles, write to Parquet.
"""
import json
import os
import statistics
from pathlib import Path
from datetime import timezone, datetime

OUTPUT_DIR = Path(os.environ["CRAWL_OUTPUT_DIR"])
RAW_LOG    = OUTPUT_DIR / "raw_crawl.jsonl"
OUT_FILE   = OUTPUT_DIR / "normalized_crawl.parquet.json"  # JSON proxy


def normalize_status(code: int) -> str:
    if 300 <= code <= 399:
        return "redirect"
    if code in (404, 410):
        return "removed"
    if code >= 500:
        return "error"
    return "success"


def calculate_url_depth(path: str) -> int:
    return len([s for s in path.split("/") if s])


def process_logs() -> None:
    records: list[dict] = []
    depth_ttfb: dict[int, list[float]] = {}

    for line in RAW_LOG.read_text().splitlines():
        if not line.strip():
            continue
        entry = json.loads(line)
        depth = calculate_url_depth(entry.get("url_path", "/"))
        ttfb  = entry.get("ttfb_ms", 0.0)
        norm  = {
            "url_path":          entry.get("url_path"),
            "depth_tier":        depth,
            "normalized_status": normalize_status(entry.get("status_code", 200)),
            "ttfb_ms":           ttfb,
            "crawled_at":        datetime.now(timezone.utc).isoformat(),
        }
        records.append(norm)
        depth_ttfb.setdefault(depth, []).append(ttfb)

    # Annotate with depth-tier p95 TTFB
    p95_by_depth = {
        d: statistics.quantiles(times, n=20)[18]  # 95th percentile
        for d, times in depth_ttfb.items()
        if len(times) >= 20
    }
    for rec in records:
        rec["p95_ttfb_depth_tier"] = p95_by_depth.get(rec["depth_tier"])

    OUT_FILE.write_text(json.dumps(records, indent=2))
    print(f"Normalized {len(records)} records → {OUT_FILE}")


if __name__ == "__main__":
    process_logs()

Artifact retention and versioning strategy:

Format: Parquet (compressed Snappy) for columnar analytics; JSON sidecar for human inspection.
Path convention: ${CRAWL_ROOT}/crawls/<ISO-8601-timestamp>/ — one directory per run, never overwritten.
Retention: 90-day rolling window. A nightly cleanup job purges directories older than 90 days.
Checksums: SHA-256 checksum file written alongside each artifact at completion.

# Compute and store artifact checksum
sha256sum "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json" \
  > "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json.sha256"

Verification checklist

Run these checks after each crawl to confirm the workflow executed correctly:

Scope manifest size is non-zero.

wc -l "${CRAWL_OUTPUT_DIR}/scope_manifest.txt"
# Expected: > 0 lines; alert if < previous run's count by > 20%

Log file contains a clean completion line.

grep "Crawl complete" "${CRAWL_OUTPUT_DIR}/logs/run.log"
# Expected: one timestamped line; absence means the run was interrupted

HTTP error rate is within tolerance.

jq '[.[] | select(.normalized_status == "error")] | length' \
  "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json"
# Expected: < 1% of total records; investigate any 5xx cluster

Artifact checksum matches.

sha256sum --check "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json.sha256"
# Expected: OK

No out-of-scope URLs present.

grep -vF "${CRAWL_DOMAIN}" "${CRAWL_OUTPUT_DIR}/scope_manifest.txt" | wc -l
# Expected: 0 lines; any non-zero count indicates allowlist leak

Crawl budget utilization does not exceed 95% on any subdomain.

jq 'group_by(.url_path | split("/")[2]) | map({subdomain: .[0].url_path | split("/")[2], count: length})' \
  "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json"

Troubleshooting

Failure 1: Exponential URL explosion — crawl runs indefinitely

Root cause: max_depth set too high (> 7) on a faceted navigation tree. Parameterized facets multiply at each depth level.

Diagnostic:

awk -F'/' '{print NF-1}' "${CRAWL_OUTPUT_DIR}/scope_manifest.txt" | sort -n | uniq -c
# Look for disproportionate counts at depth >= 5

Fix:

export CRAWL_MAX_DEPTH=4
# Add facet patterns to exclude_patterns in crawler_config.yaml
echo '    - "^/.*\\?color="' >> /var/lib/site-audit/config/crawler_config.yaml

Failure 2: IP ban or origin throttling (HTTP 429 flood)

Root cause: Crawl-Delay directive in robots.txt was ignored, or concurrent_requests too high for shared hosting.

Diagnostic:

grep "429" "${CRAWL_OUTPUT_DIR}/logs/scrapy.log" | wc -l

Fix:

# Read the actual Crawl-Delay from robots.txt before re-running
curl -s "https://${CRAWL_DOMAIN}/robots.txt" | grep -i "crawl-delay"
# Update config
export CRAWL_DELAY_MS=1000
export CRAWL_CONCURRENCY=2

Failure 3: Canonical chain inflation skews coverage metrics

Root cause: Crawled URLs included canonical-redirect targets that themselves had differing canonicals — creating chains of depth 3+ that the crawler followed as distinct URLs.

Diagnostic:

jq '[.[] | select(.normalized_status == "redirect")] | length' \
  "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json"

Fix:

# In init_scope.py: resolve the canonical chain before adding to manifest
def resolve_canonical(url: str, headers: dict) -> str:
    """Follow canonical rel links up to 3 hops; return terminal URL."""
    canonical = headers.get("Link", "")
    # parse rel=canonical from Link header or HTML <link> tag
    # return the resolved terminal canonical
    return url  # fallback

Failure 4: Staging URLs appear in production crawl output

Root cause: allowed_domains list included a staging CNAME that resolves to the production origin, or environment variable substitution failed in YAML.

Diagnostic:

grep "staging\|dev\." "${CRAWL_OUTPUT_DIR}/scope_manifest.txt" | head -20

Fix:

# Verify env substitution rendered correctly in the active config
grep "allowed_domains" -A 10 "${CRAWL_OUTPUT_DIR}/logs/scrapy.log"
# Rebuild manifest after correcting allowed_domains
export CRAWL_ENV="production"
python3 /var/lib/site-audit/scripts/init_scope.py

Failure 5: Flock guard not available in containerized environment

Root cause: util-linux not installed in the base Docker image.

Diagnostic:

which flock || echo "flock not found"

Fix:

# In Dockerfile
RUN apt-get update && apt-get install -y --no-install-recommends util-linux && rm -rf /var/lib/apt/lists/*

Failure 6: p95 TTFB annotation missing for shallow URL tiers

Root cause: Fewer than 20 URLs at depth 1 or 2 — statistics.quantiles(n=20) requires at least 20 data points.

Fix:

# Replace the p95 calculation with a fallback for small samples
p95 = statistics.quantiles(times, n=20)[18] if len(times) >= 20 \
      else max(times)  # use max as conservative proxy for tiny sets

Post-crawl scope auditing

Audit the final crawl output against the approved scope manifest. The risk scoring frameworks for technical debt provide a principled method for prioritizing the remediation backlog produced by this diff. Map orphaned nodes, infinite pagination traps, and parameter-driven bloat to severity tiers before generating tickets.

#!/usr/bin/env bash
# /var/lib/site-audit/scripts/audit_scope.py (shell wrapper)
set -euo pipefail

MANIFEST="${CRAWL_OUTPUT_DIR}/scope_manifest.txt"
CRAWLED="${CRAWL_OUTPUT_DIR}/crawled_urls.txt"
DIFF_OUT="${CRAWL_OUTPUT_DIR}/scope_diff.txt"

# Extract crawled URLs from normalized output
jq -r '.[].url_path' "${CRAWL_OUTPUT_DIR}/normalized_crawl.parquet.json" \
  | sort -u > "${CRAWLED}"

# Out-of-bounds: crawled but not in approved manifest
comm -23 <(sort "${CRAWLED}") <(sort "${MANIFEST}") > "${DIFF_OUT}"

echo "Out-of-scope URLs found: $(wc -l < "${DIFF_OUT}")"
if [[ $(wc -l < "${DIFF_OUT}") -gt 0 ]]; then
  head -20 "${DIFF_OUT}"
  exit 1
fi

What max_depth value should I set for a large enterprise catalog?

Start at 4 for most enterprise catalogs. Override per subdomain if content architecture differs — blog sections rarely exceed depth 3 while product facets can reach 6. Setting depth above 7 causes exponential URL explosion that exhausts the crawl budget before high-priority content trees are covered.

How do I prevent parameter-driven URL bloat from inflating crawl scope?

Define exclude_patterns in your config for known parameter prefixes (utm_, sort=, ref=, page= beyond 2 digits). Combine with canonical normalization so the crawler treats parameter variants as a single canonical URL. Audit scope_manifest.txt after initialization to confirm the filter is working before committing a full run.

Should I crawl staging and production in the same pipeline run?

No. Use separate allowed_domains lists and separate config files. Mixing environments inflates crawl scope, produces misleading baseline metrics, and risks accidentally indexing staging content if the crawler output feeds a downstream indexing pipeline.

How do I handle subdomains with different content architectures?

Assign per-subdomain budget_cap and max_depth overrides in your pipeline config. A documentation subdomain (docs.example.com) may require depth 6, while a marketing site stays at depth 3. Document the rationale for each override in a comment adjacent to the value so future maintainers understand the constraint.

Technical Audit Fundamentals & Scope Mapping — parent section covering the full audit lifecycle
How to Map URL Hierarchies Before Running a Crawl — companion guide for depth calculation and directory routing patterns
Establishing Baseline Health Metrics for New Domains — calibrate alert thresholds against normalized crawl outputs
Managing Crawl Budget & Rate Limiting — concurrency controls and backoff configuration
Risk Scoring Frameworks for Technical Debt — prioritize the remediation backlog identified by scope auditing

Defining Crawl Depth & Scope for Enterprise Sites #

Pipeline stages overview #

Prerequisites & environment setup #

Step 1 — Initialization: boundary definition #

Step 2 — Core configuration: parameters & production values #

Step 3 — Execution & scheduling #

Step 4 — Artifact capture & storage #

Verification checklist #

Troubleshooting #

Failure 1: Exponential URL explosion — crawl runs indefinitely #

Failure 2: IP ban or origin throttling (HTTP 429 flood) #

Failure 3: Canonical chain inflation skews coverage metrics #

Failure 4: Staging URLs appear in production crawl output #

Failure 5: Flock guard not available in containerized environment #

Failure 6: p95 TTFB annotation missing for shallow URL tiers #

Post-crawl scope auditing #

Related #

Defining Crawl Depth & Scope for Enterprise Sites

Pipeline stages overview

Prerequisites & environment setup

Step 1 — Initialization: boundary definition

Step 2 — Core configuration: parameters & production values

Step 3 — Execution & scheduling

Step 4 — Artifact capture & storage

Verification checklist

Troubleshooting

Failure 1: Exponential URL explosion — crawl runs indefinitely

Failure 2: IP ban or origin throttling (HTTP 429 flood)

Failure 3: Canonical chain inflation skews coverage metrics

Failure 4: Staging URLs appear in production crawl output

Failure 5: Flock guard not available in containerized environment

Failure 6: p95 TTFB annotation missing for shallow URL tiers

Post-crawl scope auditing

Related