How often should automated crawls run in production?

Weekly full crawls with daily incremental checks on high-priority URL sets is the standard baseline. Sites with frequent deploys should trigger crawls on every main-branch merge via CI/CD.

14 min read

Automated Crawling & Pipeline Tooling

Q: What is a safe crawl rate for most production origins?

Start at 1–2 requests per second with polite delays and raise the ceiling only after confirming the origin's X-RateLimit-Remaining headers indicate headroom. Adaptive token-bucket limiters are preferred over static sleeps.

Q: Which artifact formats are best for regression diffing?

NDJSON and Parquet with per-field checksums enable deterministic diffs across crawl sessions. HAR files are useful for rendering-side forensics but are too large for bulk storage without filtering.

Automated crawl pipelines replace one-off manual audits with deterministic, repeatable workflows that SREs, SEO engineers, and agency teams run continuously against staging and production. Every execution produces an immutable telemetry record: response codes, render timings, canonical chains, and Core Web Vitals — all versioned, diffable, and ready to feed alert routing and health score dashboards. This guide describes a production-ready four-stage architecture from crawler initialisation through to remediation routing.

Pipeline Architecture at a Glance

The stages below map to the topics covered in this section. Each arrow represents a structured artifact hand-off: no stage starts until the previous one has written a validated output.

Cross-Cutting Concerns Before You Start

Three properties must be baked into every stage — not bolted on at the end:

Idempotency. Every run gets a UUID SESSION_ID. Re-running the same session must produce the same outputs, not duplicate artifacts. Guard all upload and scoring operations with a session-ID existence check.
Environment parity. Staging and production crawls must use identical Docker images, browser binary versions, and robots.txt overrides. Divergent runtimes are the most common source of false negatives.
Data retention. Define lifecycle policies at the storage layer on day one. Raw NDJSON and HAR files typically age out after 90 days; scored summary Parquet files are worth keeping for 12 months to support trend analysis and SLA reporting.

Phase 1 — Infrastructure Provisioning & Crawler Initialisation

Establish a deterministic, containerised environment that guarantees identical crawl baselines across every run. Dependency pinning (Docker image SHA, Node version, browser binary) and isolated network namespaces prevent cross-contamination between concurrent sessions.

When targeting modern SPAs or hydration-heavy architectures, route discovery through headless browser execution for JS-heavy sites to capture client-side routing states, lazy-loaded resources, and dynamic DOM mutations before baseline indexing begins.

Infrastructure requirements

Concern	Production recommendation
Base image	`ghcr.io/puppeteer/puppeteer:24.9.0` (pinned SHA in CI)
Node version	22 LTS, specified via `.nvmrc`
Network	Isolated Docker bridge; proxy routing optional
`robots.txt`	Parse and honour before seed URL ingestion
Browser binaries	`PUPPETEER_SKIP_DOWNLOAD=true`, binary bundled in image

# Dockerfile.crawler — pin every dependency; never use :latest
FROM node:22-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --omit=dev

FROM ghcr.io/puppeteer/puppeteer:24.9.0
WORKDIR /app
COPY --from=builder /app/node_modules ./node_modules
COPY src/ ./src/
COPY config/ ./config/

ENV PUPPETEER_SKIP_DOWNLOAD=true
ENV NODE_ENV=production
# SESSION_ID injected at runtime by CI; not baked into the image
EXPOSE 8080

CMD ["node", "src/crawler.js"]

Common initialisation mistakes

Using :latest tags in the FROM line means crawls diverge silently across environments.
Bundling devDependencies into the image bloats it by 200–400 MB and introduces non-determinism.
Forgetting to parse robots.txt before the first request causes rule violations and potential IP bans on shared hosting.

Rollback. If the container refuses to start, the most common cause is a Chrome/Node ABI mismatch. Pin the Puppeteer image SHA in .github/workflows/ and re-run; do not manually install Chrome inside a generic Node image.

Phase 2 — Pipeline Orchestration & Concurrency Controls

Orchestration enforces strict concurrency limits, exponential backoff, and idempotent execution triggers. Integrating custom crawlers with CI/CD pipelines standardises audit scheduling, environment variable injection, secrets management, and automated teardown across GitHub Actions and GitLab CI.

Concurrently, managing crawl budget and rate limiting prevents origin server degradation, enforces polite crawl delays, and aligns discovery depth with server capacity constraints. The two concerns are tightly coupled: your CI workflow should read the server's X-RateLimit-Remaining header and feed it to the rate limiter before each batch.

Deterministic orchestration rules

Matrix builds across [staging, production] must run sequentially against the same origin — never in parallel.
Retry logic must include jitter (± 500 ms) to avoid thundering-herd reconnects on transient failures.
Circuit breakers should open after three consecutive 5xx responses and stay open for five minutes before a half-open probe.
All environment secrets (SEED_URL, S3_BUCKET, WEBHOOK_URL) must be injected at runtime — never baked into images or committed to VCS.

# .github/workflows/crawl-pipeline.yml
name: Automated Crawl Pipeline
on:
  push:
    branches: [main]
  schedule:
    - cron: '0 2 * * 1'   # Weekly Monday 02:00 UTC

env:
  SESSION_ID: ${{ github.run_id }}-${{ github.run_attempt }}

jobs:
  crawl:
    runs-on: ubuntu-latest
    strategy:
      max-parallel: 1          # never hit the same origin concurrently
      matrix:
        env: [staging, production]
    steps:
      - uses: actions/checkout@v4

      - name: Run crawler
        env:
          SEED_URL: ${{ secrets.SEED_URL }}
          TARGET_ENV: ${{ matrix.env }}
        run: |
          docker compose run --rm crawler \
            --env "$TARGET_ENV" \
            --seed "$SEED_URL" \
            --session-id "$SESSION_ID"

      - name: Upload telemetry artifacts
        env:
          S3_BUCKET: ${{ secrets.S3_BUCKET }}
        run: ./scripts/upload_artifacts.sh "$TARGET_ENV" "$SESSION_ID"

Token-bucket rate limiting decouples the crawler's request cadence from the upstream server's capacity. The implementation below refills tokens at a configurable rate RPS and caps outstanding tokens at capacity:

# rate_limiter.py — token-bucket with adaptive ceiling
import asyncio
import time

class TokenBucketLimiter:
    def __init__(self, rate: float, capacity: int):
        self.rate = rate          # tokens per second
        self.capacity = capacity  # maximum burst size
        self.tokens = float(capacity)
        self.last_refill = time.monotonic()

    async def acquire(self) -> None:
        while True:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(self.capacity, self.tokens + elapsed * self.rate)
            self.last_refill = now
            if self.tokens >= 1:
                self.tokens -= 1
                return
            await asyncio.sleep(0.05)   # 50 ms poll; small enough to stay responsive

    def adapt(self, remaining: int, limit: int) -> None:
        """Lower rate when server signals low headroom via X-RateLimit headers."""
        ratio = remaining / max(limit, 1)
        if ratio < 0.2:
            self.rate = max(0.5, self.rate * 0.75)
        elif ratio > 0.8:
            self.rate = min(5.0, self.rate * 1.1)

Verification steps

Confirm the CI workflow run log shows SESSION_ID printed at job start — this proves idempotency guards are active.
Tail the crawler container logs; confirm 429 responses trigger the backoff path, not an immediate retry.
Check the CI artifact store for the rate_stats.json file — it should show adapt() calls if the server issued any rate-limit headers.

Phase 3 — Data Ingestion & Artifact Persistence

Execution pipelines normalise raw HTTP responses into structured telemetry streams. All raw logs, HAR files, response headers, and extracted metadata must be stored and versioned in cloud storage as immutable objects to enable forensic diffing, regression tracking, and SLA compliance auditing.

Environment variable injection

Every upload script reads configuration from the environment — never from hardcoded values:

Variable	Purpose	Example value
`S3_BUCKET`	Destination bucket URI	`s3://crawl-artifacts-prod`
`SESSION_ID`	Run correlation key	`12345678-1`
`ENV`	Target environment tag	`production`
`STORAGE_CLASS`	AWS storage tier	`INTELLIGENT_TIERING`

Idempotency guard. Before uploading, check whether $S3_BUCKET/$SESSION_ID/manifest.json already exists. If it does, the session has already been persisted — skip the upload and exit 0. This prevents duplicate artifacts when CI retries a failed step.

#!/usr/bin/env bash
# scripts/upload_artifacts.sh
set -euo pipefail

TARGET_ENV="${1:?TARGET_ENV required}"
SESSION_ID="${2:?SESSION_ID required}"
BUCKET="${S3_BUCKET:?S3_BUCKET env var not set}"
OUTPUT_DIR="./output"
MANIFEST="$BUCKET/$SESSION_ID/manifest.json"

# Idempotency guard — skip if this session was already uploaded
if aws s3 ls "$MANIFEST" &>/dev/null; then
  echo "Session $SESSION_ID already uploaded. Skipping." && exit 0
fi

find "$OUTPUT_DIR" -type f \( -name "*.json" -o -name "*.ndjson" -o -name "*.har" \) \
| while IFS= read -r file; do
    SHA=$(sha256sum "$file" | awk '{print $1}')
    KEY="$SESSION_ID/$(basename "$file")"
    aws s3 cp "$file" "$BUCKET/$KEY" \
      --metadata "sha256=$SHA,session=$SESSION_ID,env=$TARGET_ENV" \
      --storage-class "${STORAGE_CLASS:-INTELLIGENT_TIERING}"
    echo "  uploaded: $KEY (sha256=$SHA)"
  done

# Write manifest last — its presence is the idempotency sentinel
echo "{\"session\":\"$SESSION_ID\",\"env\":\"$TARGET_ENV\",\"uploaded_at\":\"$(date -u +%FT%TZ)\"}" \
  | aws s3 cp - "$MANIFEST"

echo "Artifacts uploaded. Session: $SESSION_ID"

Output format guidance. NDJSON (one JSON object per line) is the best default for raw crawl records: it streams without buffering, is trivially compressed with gzip, and parses without a full-file read. Convert NDJSON to Parquet in a post-process step when you need columnar querying for normalising performance data across device types or trend analysis.

Phase 4 — Health Scoring, Alerting & Remediation Routing

Post-execution validation enforces schema compliance, data integrity checks, and weighted health scoring against historical baselines. Threshold violations trigger automated routing to incident management. Structured violation reports feed directly into remediation playbooks, closing the audit lifecycle from raw telemetry to actioned fix.

Scoring Architecture

The scoring stage consumes the NDJSON output from Phase 3 and applies configurable weights to each signal class:

# scorer.py — weighted health score with baseline diffing
from __future__ import annotations
import json
import sys
from dataclasses import dataclass, field
from pathlib import Path

WEIGHTS: dict[str, float] = {
    "status_4xx":       0.30,
    "status_5xx":       0.40,
    "render_blocking":  0.15,
    "canonical_conflict": 0.10,
    "lcp_over_2500ms":  0.05,
}

@dataclass
class CrawlRecord:
    url: str
    status: int
    lcp_ms: int | None
    render_blocking_count: int
    canonical_mismatch: bool

    def penalty(self) -> float:
        score = 0.0
        if 400 <= self.status < 500:
            score += WEIGHTS["status_4xx"]
        if self.status >= 500:
            score += WEIGHTS["status_5xx"]
        if self.render_blocking_count > 2:
            score += WEIGHTS["render_blocking"]
        if self.canonical_mismatch:
            score += WEIGHTS["canonical_conflict"]
        if self.lcp_ms and self.lcp_ms > 2500:
            score += WEIGHTS["lcp_over_2500ms"]
        return score

def load_session(path: Path) -> list[CrawlRecord]:
    records = []
    with path.open() as fh:
        for line in fh:
            obj = json.loads(line)
            records.append(CrawlRecord(
                url=obj["url"],
                status=obj["status"],
                lcp_ms=obj.get("lcp_ms"),
                render_blocking_count=obj.get("render_blocking_count", 0),
                canonical_mismatch=obj.get("canonical_mismatch", False),
            ))
    return records

if __name__ == "__main__":
    session_file = Path(sys.argv[1])
    records = load_session(session_file)
    total_penalty = sum(r.penalty() for r in records)
    avg_score = 1.0 - (total_penalty / max(len(records), 1))
    print(json.dumps({"session": session_file.stem, "health_score": round(avg_score, 4), "pages": len(records)}))

Alerting & Threshold Configuration

Route violations at the right severity level from day one. Tighten thresholds progressively as you calibrate error thresholds for different site sections.

Violation type	Warning threshold	Critical threshold	Default routing
Health score drop	−3 points	−8 points	Slack `#seo-alerts`
5xx rate	> 0.5 %	> 2 %	PagerDuty (P2)
Render-blocking resources	> 3 per page	> 6 per page	Jira backlog
Canonical conflicts	> 1 % of URLs	> 5 % of URLs	Slack `#tech-seo`
LCP regression	> 10 % increase	> 25 % increase	Jira sprint

#!/usr/bin/env bash
# scripts/alert_router.sh — send webhook on threshold breach
set -euo pipefail

SCORE=$(jq -r '.health_score' ./output/score.json)
PREV_SCORE=$(aws s3 cp "s3://$S3_BUCKET/latest/score.json" - 2>/dev/null | jq -r '.health_score' || echo "$SCORE")

DELTA=$(python3 -c "print(round(($SCORE - $PREV_SCORE) * 100, 2))")

if python3 -c "exit(0 if $DELTA < -8 else 1)"; then
  curl -s -X POST "$WEBHOOK_URL" \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"CRITICAL: Health score dropped ${DELTA} points (session: $SESSION_ID)\"}"
  exit 1
elif python3 -c "exit(0 if $DELTA < -3 else 1)"; then
  curl -s -X POST "$WEBHOOK_URL" \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"WARNING: Health score dropped ${DELTA} points (session: $SESSION_ID)\"}"
fi

Cross-Cutting Concerns

Data Retention

Define bucket lifecycle policies before the first crawl, not after you've accumulated weeks of unbounded data.

{
  "Rules": [
    {
      "ID": "expire-raw-ndjson",
      "Status": "Enabled",
      "Filter": { "Prefix": "raw/" },
      "Expiration": { "Days": 90 }
    },
    {
      "ID": "expire-har-files",
      "Status": "Enabled",
      "Filter": { "Prefix": "har/" },
      "Expiration": { "Days": 30 }
    },
    {
      "ID": "retain-score-summaries",
      "Status": "Enabled",
      "Filter": { "Prefix": "scores/" },
      "Expiration": { "Days": 365 }
    }
  ]
}

Version Control for Pipeline Configuration

Keep all YAML, Dockerfile, and scoring configuration in the same repository as the application code. Use Git tags to correlate pipeline-config versions with the crawl sessions they produced — the SESSION_ID should include the short Git SHA when run from CI:

SESSION_ID="${GITHUB_RUN_ID}-$(git rev-parse --short HEAD)"

This makes it trivial to reproduce any historical crawl by checking out the tagged commit and re-running against the same versioned Docker image.

Environment Parity Checklist

Before promoting a pipeline change from staging to production:

Docker image SHA is identical in both environments.
robots.txt override rules are consistent (or absent).
Rate limit ceilings reflect the production origin's capacity, not the staging CDN's.
Scoring weights and alert thresholds are sourced from the same config file — not duplicated.
Cloud storage lifecycle policies are applied to the production bucket.

Failure Modes & Rollback

Failure pattern	Symptom	Recovery command
Chrome/Node ABI mismatch	Container exits with `Error: libnss3.so not found`	`docker pull ghcr.io/puppeteer/puppeteer:24.9.0@sha256:<pinned>` and re-run
Duplicate session artifacts	Score delta shows impossible swings between runs	`aws s3 rm s3://$S3_BUCKET/$SESSION_ID/ --recursive` then re-run with idempotency guard enabled
Rate limiter bypassed	Origin returns sustained 429s; crawl stalls	Lower `rate` to 0.5 RPS: `export CRAWL_RATE=0.5` then restart the crawler container
CI workflow secrets not injected	Crawler exits with `SEED_URL env var not set`	Verify the repository secret is set under Settings → Secrets → Actions; confirm the env block references the correct secret name
Headless browser fails on SPA	JS-rendered routes missing from output NDJSON	Switch to full-page-load mode: `--wait-until networkidle2` flag in the Puppeteer config
Artifact upload fails mid-session	Partial session in S3; score calculated on incomplete data	Delete the partial session prefix, fix the upload script, re-run the upload step with the same `SESSION_ID`
Health score schema mismatch	`scorer.py` raises `KeyError` on new field names	Pin the NDJSON schema version in the scoring config; add backward-compatible defaults for new fields before the next deploy

FAQ

How often should automated crawls run?

Weekly full crawls with daily incremental checks against high-priority URL sets is the baseline. Sites with frequent deployments should trigger crawls on every main branch merge via the CI/CD integration.

What is a safe request rate for most production origins?

Start at 1–2 RPS with polite delays and raise the ceiling only after confirming the origin's X-RateLimit-Remaining headers indicate headroom. Adaptive token-bucket limiters are more reliable than static sleep calls because they respond to real-time server feedback.

Which artifact formats work best for regression diffing?

NDJSON with per-field SHA-256 checksums enables deterministic diffs across crawl sessions. HAR files are useful for render-side forensics but are too large for bulk retention without prefix filtering.

Configuring Headless Browsers for JS-Heavy Sites — handle SPAs, hydration boundaries, and lazy-loaded resources
Integrating Custom Crawlers with CI/CD Pipelines — GitHub Actions and GitLab CI workflow patterns
Managing Crawl Budget & Rate Limiting — token-bucket strategies and adaptive concurrency
Storing & Versioning Crawl Artifacts in Cloud Storage — immutable artifact persistence and lifecycle policies
Orchestrating Distributed Crawls Across Workers — shared URL frontier, consistent-hash sharding, and distributed dedup
Metric Scoring & Data Normalisation — converting raw telemetry into actionable health scores

Automated Crawling & Pipeline Tooling #

Pipeline Architecture at a Glance #

Cross-Cutting Concerns Before You Start #

Phase 1 — Infrastructure Provisioning & Crawler Initialisation #

Phase 2 — Pipeline Orchestration & Concurrency Controls #

Phase 3 — Data Ingestion & Artifact Persistence #

Phase 4 — Health Scoring, Alerting & Remediation Routing #

Scoring Architecture #

Alerting & Threshold Configuration #

Cross-Cutting Concerns #

Data Retention #

Version Control for Pipeline Configuration #

Environment Parity Checklist #

Failure Modes & Rollback #

FAQ #

Related #