How do I prevent the crawler from triggering rate limits?

Set a realistic concurrency ceiling (typically 2–4 parallel contexts), add a polite delay between page navigations (500–1000 ms), and use a unique but honest User-Agent string. Refer to the managing crawl budget and rate limiting workflow for token-bucket implementation details.

Why do my LCP readings differ between headless and real browsers?

Headless Chrome disables GPU acceleration, which changes paint timing. Images and fonts loaded via route.fulfill() mock responses register differently than real network loads. Normalize readings by always running headless with a fixed viewport, disabling cache, and applying the same throttling profile across every run.

15 min read

Configuring Headless Browsers for JS-Heavy Sites

Q: Should I use Puppeteer or Playwright for SPA crawling?

Playwright is preferred for production SPA crawling: it supports Chromium, Firefox, and WebKit in a single API, provides native network interception via page.route(), and ships official Docker images with all system dependencies pre-installed. Puppeteer is narrower (Chromium only) but lighter if you only need Chrome DevTools Protocol access.

Q: How do I detect when a SPA has finished rendering?

The most reliable signal is combining networkidle (no more than 0 in-flight requests for 500 ms) with a PerformanceObserver that confirms the largest contentful paint entry has fired. Avoid fixed setTimeout delays — they produce false positives on slow connections and false negatives on fast ones.

Without a JavaScript execution environment, traditional HTTP crawlers return the bare HTML shell of a React, Vue, or Angular application — missing all rendered content, injected meta tags, and lazy-loaded links. SREs and SEO engineers who rely on those empty shells produce audits that miss real user-facing defects: broken client-side routes, deferred LCP images that never appear in the snapshot, and WCAG violations injected by hydration. This page is part of the Automated Crawling & Pipeline Tooling section and covers the full workflow for deploying containerized headless browsers against JS-heavy sites, from engine selection through artifact storage.

Pipeline Overview

The diagram below shows the five stages this workflow moves through from raw URL to stored artifact.

Prerequisites & Environment Setup

Before launching any browser automation, pin every dependency so CI builds are reproducible. Floating versions cause phantom metric variance when Chromium receives a silent upstream update.

Dependency	Pinned version	Purpose
`@playwright/test`	`1.52.0`	Browser automation API + test runner
`node`	`22.16.0-alpine`	Runtime — alpine reduces image size by ~200 MB
`playwright` Docker image	`mcr.microsoft.com/playwright:v1.52.0-noble`	Pre-installed Chromium + system deps
`polars`	`0.20.31`	Metric normalization and DataFrame serialization
`python`	`3.12`	Orchestration scripts and artifact post-processing

Required environment variables:

export CRAWLER_BASE_URL="https://target-site.com"
export CRAWL_CONCURRENCY=3           # parallel browser contexts
export CRAWL_TIMEOUT_MS=15000        # per-page navigation ceiling
export ARTIFACTS_DIR="/var/crawl-artifacts"
export CRAWL_RUN_ID="$(date -u +%Y%m%dT%H%M%SZ)-$$"  # unique per-run ID

Lock these into a .env.crawler file committed to the repository (with secrets injected at runtime from your secrets manager, not baked into the file).

Step 1 — Container Provisioning & Browser Initialization

Build a minimal, non-root container that isolates the browser runtime from the host. The official Playwright image bundles all required system libraries (libnss3, libdrm2, fonts-liberation), so start from it rather than installing dependencies ad hoc.

# syntax=docker/dockerfile:1
FROM mcr.microsoft.com/playwright:v1.52.0-noble AS base

# Harden: drop root privileges before any browser process starts
RUN groupadd -r crawler && useradd -r -g crawler --create-home crawler

WORKDIR /app
COPY --chown=crawler:crawler package*.json ./
RUN npm ci --omit=dev --ignore-scripts

# Copy application source after dep install to maximize layer caching
COPY --chown=crawler:crawler . .

USER crawler
CMD ["node", "crawl.js"]

Apply --no-sandbox and --disable-gpu only when the container already provides OS-level process isolation. Enabling them on a bare-metal host creates a real privilege-escalation vector.

// crawl.ts — browser factory with safe flag handling
import { chromium, Browser, BrowserContext } from 'playwright';

const IS_CONTAINER = process.env.CONTAINER === '1';

export async function launchBrowser(): Promise<Browser> {
  return chromium.launch({
    args: IS_CONTAINER
      ? ['--no-sandbox', '--disable-gpu', '--disable-dev-shm-usage']
      : [],
    timeout: 30_000,
  });
}

export async function newAuditContext(browser: Browser): Promise<BrowserContext> {
  return browser.newContext({
    userAgent: 'SiteHealthAuditBot/1.0 (+https://site-health-audit.com/bot)',
    viewport: { width: 1280, height: 800 },
    ignoreHTTPSErrors: false,   // always validate TLS in production
    javaScriptEnabled: true,
  });
}

Step 2 — Core Configuration: Network Interception & Resource Filtering

Third-party scripts — analytics beacons, ad tags, consent management overlays — add hundreds of milliseconds to page load and skew every LCP reading you collect. Intercept them at the page.route() layer before navigation begins. This aligns with managing crawl budget and rate limiting best practices: fewer bytes in flight means faster renders and lower origin load.

Network filter parameters

Parameter	Type	Default	Purpose
`blockRegex`	`RegExp`	`/(analytics\|ads\|tracking\|cdn\.media)/i`	Abort matching third-party URLs
`stubResourceTypes`	`string[]`	`['image', 'font']`	Fulfill with 204 to skip download
`allowList`	`string[]`	`[]`	Domain prefixes that bypass all filters
`maxResponseBytes`	`number`	`5_242_880` (5 MB)	Abort responses exceeding this size

// network-filter.ts
import { Page } from 'playwright';

const BLOCK_REGEX = /(analytics|ads|tracking|cdn\.media|doubleclick|facebook\.net)/i;
const STUB_TYPES = new Set(['image', 'font', 'media']);
const ALLOW_LIST = (process.env.ALLOW_DOMAINS ?? '').split(',').filter(Boolean);

export async function applyNetworkFilter(page: Page): Promise<void> {
  await page.route('**/*', async (route, request) => {
    const url = request.url();

    // Always pass requests on the explicit allow list
    if (ALLOW_LIST.some(d => url.startsWith(d))) {
      return route.continue();
    }

    // Abort tracking / ad traffic entirely
    if (BLOCK_REGEX.test(url)) {
      return route.abort('blockedbyclient');
    }

    // Stub heavy resource types with an empty 204 to unblock hydration
    if (STUB_TYPES.has(request.resourceType())) {
      return route.fulfill({ status: 204, body: '' });
    }

    return route.continue();
  });
}

Step 3 — Execution & Scheduling: DOM Hydration Monitoring

The most common source of invalid snapshots in SPA auditing is capturing the DOM before client-side rendering is complete. waitUntil: 'networkidle' is a reasonable first gate but insufficient on its own — a SPA can transition to a loading spinner while all network activity settles. Layer three explicit checks:

networkidle — no in-flight requests for 500 ms
document.readyState === 'complete'
A target-specific sentinel element is present (e.g. #app-root, [data-app-loaded])

Intercept history.pushState and history.replaceState to detect client-side route changes and re-trigger the wait sequence after each navigation — critical for auditing multi-page SPAs without missing pages.

// hydration-guard.ts
import { Page } from 'playwright';

export async function waitForHydration(
  page: Page,
  sentinelSelector: string = '#app-root',
  timeoutMs: number = 15_000
): Promise<void> {
  // Gate 1: network idle
  await page.waitForLoadState('networkidle', { timeout: timeoutMs });

  // Gate 2 + 3: readyState and sentinel element
  await page.waitForFunction(
    (selector) => {
      return (
        document.readyState === 'complete' &&
        document.querySelector(selector) !== null
      );
    },
    sentinelSelector,
    { timeout: timeoutMs }
  );
}

export async function interceptSPANavigation(page: Page): Promise<string[]> {
  const visitedRoutes: string[] = [];

  await page.exposeFunction('__auditTrackRoute', (url: string) => {
    visitedRoutes.push(url);
  });

  await page.addInitScript(() => {
    const orig = history.pushState.bind(history);
    history.pushState = function (...args) {
      orig(...args);
      (window as any).__auditTrackRoute(location.href);
    };
  });

  return visitedRoutes;
}

Schedule crawl runs using cron with a flock guard to prevent overlapping executions on the same host:

#!/usr/bin/env bash
# /opt/crawlers/run-headless.sh
set -euo pipefail

LOCK_FILE="/var/lock/headless-crawler.lock"
LOG_DIR="/var/log/crawlers"
ARTIFACTS_DIR="/var/crawl-artifacts"
RUN_ID="$(date -u +%Y%m%dT%H%M%SZ)-$$"

mkdir -p "${LOG_DIR}" "${ARTIFACTS_DIR}/${RUN_ID}"

exec 9>"${LOCK_FILE}"
flock -n 9 || { echo "[$(date -u +%FT%TZ)] Already running, exiting." >&2; exit 0; }

docker run --rm \
  --env CONTAINER=1 \
  --env CRAWL_RUN_ID="${RUN_ID}" \
  --env CRAWLER_BASE_URL="${CRAWLER_BASE_URL}" \
  --env CRAWL_CONCURRENCY=3 \
  --memory=2g \
  --cpus=1.5 \
  --security-opt no-new-privileges \
  -v "${ARTIFACTS_DIR}/${RUN_ID}:/app/artifacts" \
  ghcr.io/team/headless-crawler:1.52.0 \
  node crawl.js --output /app/artifacts \
  2>&1 | tee -a "${LOG_DIR}/crawler-${RUN_ID}.log"

Add to crontab (UTC timezone, 02:30 daily):

30 2 * * * /opt/crawlers/run-headless.sh >> /var/log/crawlers/cron.log 2>&1

For CI/CD pipeline integration, use a matrix strategy to run desktop, mobile, and tablet viewport profiles in parallel:

# .github/workflows/headless-crawl.yml
name: Headless Crawl Pipeline
on:
  schedule:
    - cron: '30 2 * * *'
  workflow_dispatch:
    inputs:
      base_url:
        description: "Target URL override"
        required: false

env:
  CRAWLER_BASE_URL: ${{ inputs.base_url || 'https://target-site.com' }}
  CRAWL_TIMEOUT_MS: "15000"

jobs:
  crawl:
    runs-on: ubuntu-latest
    container:
      image: mcr.microsoft.com/playwright:v1.52.0-noble
      options: --user crawler
    strategy:
      matrix:
        viewport: [desktop, mobile, tablet]
      fail-fast: false
    steps:
      - uses: actions/checkout@v4
      - run: npm ci --omit=dev
      - run: |
          node crawl.js \
            --viewport=${{ matrix.viewport }} \
            --output=artifacts/${{ matrix.viewport }} \
            --run-id=${{ github.run_id }}-${{ matrix.viewport }}
      - uses: actions/upload-artifact@v4
        with:
          name: crawl-${{ matrix.viewport }}-${{ github.run_id }}
          path: artifacts/${{ matrix.viewport }}/
          retention-days: 30

Step 4 — Artifact Capture & Metric Normalization

After hydration is confirmed, extract Web Vitals via PerformanceObserver, serialize the rendered DOM, and package everything into a normalized JSON schema before storing versioned crawl artifacts in cloud storage.

// metrics-extractor.ts
import { Page } from 'playwright';
import * as crypto from 'crypto';

export interface CrawlMetrics {
  url: string;
  lcp_ms: number;
  cls: number;
  inp_ms: number;
  ttfb_ms: number;
  dom_hash: string;
  captured_at: string;
}

export async function extractMetrics(page: Page, url: string): Promise<CrawlMetrics> {
  const vitals = await page.evaluate((): Promise<{lcp: number; cls: number; inp: number; ttfb: number}> => {
    return new Promise((resolve) => {
      const result = { lcp: 0, cls: 0, inp: 0, ttfb: 0 };

      const nav = performance.getEntriesByType('navigation')[0] as PerformanceNavigationTiming;
      result.ttfb = nav ? nav.responseStart - nav.requestStart : 0;

      new PerformanceObserver((list) => {
        for (const entry of list.getEntries()) {
          result.lcp = entry.startTime;
        }
      }).observe({ type: 'largest-contentful-paint', buffered: true });

      new PerformanceObserver((list) => {
        for (const entry of list.getEntries() as any[]) {
          result.cls += entry.value ?? 0;
        }
      }).observe({ type: 'layout-shift', buffered: true });

      new PerformanceObserver((list) => {
        for (const entry of list.getEntries() as any[]) {
          result.inp = Math.max(result.inp, entry.processingStart - entry.startTime);
        }
      }).observe({ type: 'event', buffered: true, durationThreshold: 16 });

      // Resolve after a brief settle window
      setTimeout(() => resolve(result), 500);
    });
  });

  const domSnapshot = await page.evaluate(
    () => new XMLSerializer().serializeToString(document.documentElement)
  );
  const domHash = crypto.createHash('sha256').update(domSnapshot).digest('hex');

  return {
    url,
    lcp_ms: Math.round(vitals.lcp),
    cls: parseFloat(vitals.cls.toFixed(4)),
    inp_ms: Math.round(vitals.inp),
    ttfb_ms: Math.round(vitals.ttfb),
    dom_hash: domHash,
    captured_at: new Date().toISOString(),
  };
}

Normalize a batch of payloads into a Parquet-compatible DataFrame for downstream analytics. Cross-reference rendered DOM hashes using Automating Screaming Frog with Python Scripts to detect structural drift between runs.

#!/usr/bin/env python3
# normalize_metrics.py — convert raw JSON payloads to Parquet
import json
import sys
from pathlib import Path
import polars as pl

SCHEMA = {
    "url": pl.Utf8,
    "lcp_ms": pl.Int32,
    "cls": pl.Float64,
    "inp_ms": pl.Int32,
    "ttfb_ms": pl.Int32,
    "dom_hash": pl.Utf8,
    "captured_at": pl.Utf8,
}

def normalize(artifacts_dir: Path, run_id: str) -> None:
    records = []
    for f in artifacts_dir.glob("*.json"):
        with f.open() as fh:
            records.append(json.load(fh))

    if not records:
        print(f"No JSON artifacts found in {artifacts_dir}", file=sys.stderr)
        sys.exit(1)

    df = pl.DataFrame(records, schema=SCHEMA)
    out = artifacts_dir / f"metrics-{run_id}.parquet"
    df.write_parquet(out, compression="zstd")
    print(f"Wrote {len(df)} rows to {out}")

if __name__ == "__main__":
    normalize(Path(sys.argv[1]), sys.argv[2])

Verification Checklist

Run these checks after every crawl execution to confirm the workflow produced valid output:

Container exited cleanly — docker inspect <run_id> --format '{{.State.ExitCode}}' returns 0.
Artifact files exist — ls -lh "${ARTIFACTS_DIR}/${RUN_ID}/" shows at least one .json and one .parquet file.
No zero-byte artifacts — find "${ARTIFACTS_DIR}/${RUN_ID}" -empty -type f returns nothing.
DOM hash is not null — jq '.dom_hash | length' artifacts/*.json outputs 64 (SHA-256 hex length) for every record.
LCP readings are non-zero — jq 'select(.lcp_ms == 0)' artifacts/*.json should be empty; a zero value indicates the PerformanceObserver captured before paint.
Log file contains no ERR_CONNECTION_REFUSED — grep -c ERR_CONNECTION_REFUSED "${LOG_DIR}/crawler-${RUN_ID}.log" returns 0.
Parquet row count matches JSON count — run python3 -c "import polars as pl; df = pl.read_parquet('metrics-*.parquet'); print(len(df))" and verify it equals the number of .json files.

Troubleshooting

Failure: `TimeoutError` — page never reaches `networkidle`

Root cause: The target SPA makes long-polling or WebSocket connections that keep the network perpetually active, preventing networkidle from resolving.

Fix: Switch the hydration guard to poll document.readyState + sentinel only, bypassing networkidle:

await page.waitForFunction(
  (sel) => document.readyState === 'complete' && !!document.querySelector(sel),
  '#app-root',
  { timeout: 20_000 }
);

Failure: `ERR_CERT_AUTHORITY_INVALID` in CI

Root cause: The CI runner cannot validate the site's TLS certificate (self-signed staging cert or corporate proxy intercepting HTTPS).

Fix: Mount the corporate CA bundle into the container or set PLAYWRIGHT_SKIP_VALIDATE_HOST_REQUIREMENTS:

docker run --rm \
  -e NODE_EXTRA_CA_CERTS=/etc/ssl/certs/corp-ca.pem \
  -v /etc/ssl/certs/corp-ca.pem:/etc/ssl/certs/corp-ca.pem:ro \
  ghcr.io/team/headless-crawler:1.52.0 node crawl.js

Failure: `Out of memory` — container OOM-killed

Root cause: Opening many concurrent browser contexts without setting a memory ceiling. Each Chromium renderer process consumes 100–300 MB.

Fix: Reduce CRAWL_CONCURRENCY and set an explicit memory limit. Reuse browser contexts across pages instead of opening a new one per URL:

// Reuse one context for all pages in the batch
const context = await browser.newContext(/* options */);
for (const url of urlBatch) {
  const page = await context.newPage();
  await page.goto(url);
  // ... extract metrics
  await page.close();  // close page, not context
}
await context.close();

Failure: DOM snapshots are identical across all pages

Root cause: The SPA router is not re-rendering on page.goto() calls because all pages share the same hash-based route. The initial HTML shell is returned identically, and Playwright captures it before React/Vue mounts.

Fix: Force a fresh navigation by disabling the browser cache and waiting for the client-side router to settle:

await context.route('**/*', route => route.continue({ headers: { 'Cache-Control': 'no-store' } }));
await page.goto(url, { waitUntil: 'commit' });
await waitForHydration(page, '#app-root');

Failure: LCP is 0 in every record

Root cause: PerformanceObserver registered after the LCP event already fired. The buffered: true flag should replay past entries, but in some Chromium versions this does not work for LCP when navigation completes before the observer is registered.

Fix: Register the observer before page.goto() using page.addInitScript():

await page.addInitScript(() => {
  (window as any).__lcpValue = 0;
  new PerformanceObserver((list) => {
    for (const e of list.getEntries()) {
      (window as any).__lcpValue = e.startTime;
    }
  }).observe({ type: 'largest-contentful-paint', buffered: true });
});
await page.goto(url, { waitUntil: 'networkidle' });
const lcp = await page.evaluate(() => (window as any).__lcpValue);

Failure: `--no-sandbox` causes pipeline failure on GitHub Actions

Root cause: The GitHub Actions runner uses a user-namespace sandbox by default. The Playwright Docker image handles this automatically, but bare-runner (non-container) jobs do not.

Fix: Use the official container image or install the required kernel capabilities:

jobs:
  crawl:
    runs-on: ubuntu-latest
    container:
      image: mcr.microsoft.com/playwright:v1.52.0-noble
      options: --ipc=host   # required for Chrome's shared memory IPC

FAQ

Should I use Puppeteer or Playwright for SPA crawling?

Playwright is preferred for production SPA auditing: it supports Chromium, Firefox, and WebKit in a single API, ships official Docker images with all system dependencies pre-installed, and provides page.route() for network interception at the context level. Puppeteer is narrower (Chromium-only via CDP) but is a viable choice if you need raw DevTools Protocol access or are maintaining an existing Puppeteer codebase.

How do I detect when a SPA has finished rendering?

Combine waitUntil: 'networkidle' with a waitForFunction call that checks both document.readyState === 'complete' and the presence of a sentinel element specific to the app's mounted state. Register your PerformanceObserver before navigation using addInitScript() so you do not miss early events. Avoid fixed setTimeout delays — they produce false positives on slow connections and inflated crawl times on fast ones.

How do I prevent the crawler from triggering origin rate limits?

Cap concurrency at 2–4 parallel browser contexts, add a 500–1000 ms polite delay between page navigations, identify the crawler with a descriptive User-Agent, and follow the managing crawl budget and rate limiting workflow to configure token-bucket controls at the orchestration layer.

Why do LCP readings differ between headless and real Chrome?

Headless Chrome disables GPU compositing, which affects how paint timing is measured. Images fulfilled via route.fulfill() mock responses do not register the same way as real network loads in the paint timeline. Normalize by fixing the viewport dimensions, disabling HTTP cache, applying consistent CPU and network throttling via CDP, and always running the same Chromium version. Compare headless readings against real-browser Lighthouse runs periodically to calibrate any systematic offset.

Automated Crawling & Pipeline Tooling — parent section covering the full audit pipeline from crawler setup to artifact storage
Managing Crawl Budget & Rate Limiting — token-bucket rate limiters and concurrency controls that prevent origin degradation
Integrating Custom Crawlers with CI/CD Pipelines — wire headless crawl jobs into GitHub Actions and GitLab CI with environment variable injection
Storing & Versioning Crawl Artifacts in Cloud Storage — persist and version the HAR files, DOM snapshots, and Parquet metrics this workflow produces
Automating Screaming Frog with Python Scripts — complement headless browser output with structural link analysis from Screaming Frog
Scrapy vs Playwright for JS-Heavy Audits — choose the right engine per URL class and run a hybrid pipeline

Configuring Headless Browsers for JS-Heavy Sites #

Pipeline Overview #

Prerequisites & Environment Setup #

Step 1 — Container Provisioning & Browser Initialization #

Step 2 — Core Configuration: Network Interception & Resource Filtering #

Network filter parameters #

Step 3 — Execution & Scheduling: DOM Hydration Monitoring #

Step 4 — Artifact Capture & Metric Normalization #

Verification Checklist #

Troubleshooting #

Failure: TimeoutError — page never reaches networkidle #

Failure: ERR_CERT_AUTHORITY_INVALID in CI #

Failure: Out of memory — container OOM-killed #

Failure: DOM snapshots are identical across all pages #

Failure: LCP is 0 in every record #

Failure: --no-sandbox causes pipeline failure on GitHub Actions #

FAQ #

Related #