How often should the risk matrix be recalibrated?

Recalibrate threshold weights whenever a major site migration occurs, after a significant traffic pattern change (seasonality, campaign launch), or whenever post-fix validation reveals systematic misclassification — typically once per quarter under normal conditions.

What happens when CDN edge caching hides origin 5xx errors?

Filter nginx logs by `upstream_status` rather than the outer status code, or query your CDN's origin-shield metrics directly. A 200 at the edge with a 502 upstream is a hidden P0 — standard status-code tallies will miss it entirely.

11 min read

Prioritizing Critical vs Non-Critical Site Errors

Q: Can the same script handle both static and JS-rendered pages?

The Bash log-parsing step handles static HTTP status codes directly from nginx or CDN logs. For JS-rendered pages where a 200 response masks a hydration failure, the Python Playwright step is required. Run both in sequence; merge results on the URL key before scoring.

Enterprise technical audits routinely surface thousands of HTTP anomalies per crawl. Without a deterministic triage layer, engineering teams waste sprint capacity chasing low-impact warnings while revenue-blocking 5xx faults age in the backlog. This page shows how to apply a composite risk matrix — part of the broader risk scoring framework for technical debt — to classify every error by business impact and route it to the correct remediation queue automatically.

Error severity decision flow

The diagram below shows the full triage path from raw HTTP anomaly to ticket priority tier.

Environment isolation and dependency declaration

Run the triage script from a dedicated virtual environment so tool versions remain consistent across CI runs and local execution.

#!/usr/bin/env bash
set -euo pipefail

# Absolute paths — adjust SITE_ROOT to your deployment
export SITE_ROOT="/var/www/site-health-audit"
export LOG_DIR="/var/log/nginx"
export TRIAGE_VENV="/opt/triage-venv"
export REPORT_DIR="/tmp/triage-$(date +%Y%m%d)"

# Pin tool versions
PYTHON_BIN="${TRIAGE_VENV}/bin/python3"   # Python 3.12
HTTPX_VERSION="0.27.0"
PLAYWRIGHT_VERSION="1.44.0"

mkdir -p "${REPORT_DIR}"

# Verify dependencies
"${PYTHON_BIN}" -c "import httpx, playwright" || {
  echo "ERROR: missing dependencies — run:"
  echo "  ${TRIAGE_VENV}/bin/pip install httpx==${HTTPX_VERSION} playwright==${PLAYWRIGHT_VERSION}"
  exit 1
}

The REPORT_DIR is timestamped so successive runs never overwrite each other — a requirement when storing versioned crawl artifacts.

Implementation

The script below performs the full triage pipeline: parse nginx logs for HTTP status codes, detect hydration failures on JS-rendered pages, score each error using the composite matrix, and emit a YAML routing manifest.

#!/usr/bin/env bash
# /opt/triage-venv/bin/triage-errors.sh
# Composite error triage — outputs ${REPORT_DIR}/routing_manifest.yaml
set -euo pipefail

export SITE_ROOT="/var/www/site-health-audit"
export LOG_DIR="/var/log/nginx"
export TRIAGE_VENV="/opt/triage-venv"
export REPORT_DIR="/tmp/triage-$(date +%Y%m%d)"
PYTHON_BIN="${TRIAGE_VENV}/bin/python3"

mkdir -p "${REPORT_DIR}"

# ── Step 1: Aggregate HTTP status codes from nginx access log ──────────────
# Column 9 = HTTP status; column 7 = request URI
awk '{print $9, $7}' "${LOG_DIR}/access.log" \
  | grep -E '^[45][0-9]{2}' \
  | sort \
  | uniq -c \
  | sort -nr \
  > "${REPORT_DIR}/raw_errors.txt"

echo "[triage] Raw error counts written to ${REPORT_DIR}/raw_errors.txt"

# ── Step 2: Detect JS hydration failures (200 OK with minimal DOM) ─────────
"${PYTHON_BIN}" - <<'PYEOF'
import asyncio, json, os
from pathlib import Path
from playwright.async_api import async_playwright

REPORT_DIR = os.environ["REPORT_DIR"]
SITE_ROOT   = os.environ["SITE_ROOT"]

# Sample the top 20 high-traffic URLs from sitemap
SAMPLE_URLS = [
    f"https://site-health-audit.com{path}"
    for path in ["/", "/technical-audit-fundamentals-scope-mapping/",
                 "/metric-scoring-data-normalization/",
                 "/automated-crawling-pipeline-tooling/"]
]

async def check_hydration(url: str) -> dict:
    async with async_playwright() as p:
        browser = await p.chromium.launch(headless=True)
        page    = await browser.new_page()
        await page.goto(url, wait_until="networkidle", timeout=30_000)
        body_text = await page.inner_text("body")
        await browser.close()
        # Pages with fewer than 400 chars of rendered text are likely hydration failures
        return {"url": url, "hydration_ok": len(body_text.strip()) > 400,
                "body_chars": len(body_text.strip())}

results = asyncio.run(
    asyncio.gather(*[check_hydration(u) for u in SAMPLE_URLS])
)
out = Path(REPORT_DIR) / "hydration_check.json"
out.write_text(json.dumps(results, indent=2))
print(f"[triage] Hydration check written to {out}")
PYEOF

# ── Step 3: Score and route using composite matrix ─────────────────────────
"${PYTHON_BIN}" - <<'PYEOF'
import json, re, os
from pathlib import Path
import yaml  # PyYAML 6.0

REPORT_DIR = os.environ["REPORT_DIR"]

# Revenue-critical path prefixes (adjust per site)
REVENUE_PATHS = ["/checkout", "/purchase", "/subscribe", "/lead", "/contact"]

def revenue_critical(path: str) -> bool:
    return any(path.startswith(p) for p in REVENUE_PATHS)

def parse_raw_errors(filepath: str) -> list[dict]:
    entries = []
    pattern = re.compile(r"^\s*(\d+)\s+(\d{3})\s+(.+)$")
    for line in Path(filepath).read_text().splitlines():
        m = pattern.match(line)
        if m:
            entries.append({
                "count":  int(m.group(1)),
                "status": int(m.group(2)),
                "path":   m.group(3).strip()
            })
    return entries

def score_error(entry: dict, total_requests: int) -> dict:
    """
    Composite score: revenue_weight (0–4) + frequency_weight (0–3)
    + crawlability_weight (0–2) + severity_weight (0–1) = 0–10
    """
    rv = 4 if revenue_critical(entry["path"]) else 0
    fq = 3 if (entry["count"] / max(total_requests, 1)) > 0.01 else \
         2 if (entry["count"] / max(total_requests, 1)) > 0.001 else 1
    cw = 2 if entry["status"] in (500, 502, 503, 504) else \
         1 if entry["status"] == 404 else 0
    sv = 1 if entry["status"] >= 500 else 0
    total = rv + fq + cw + sv

    if total >= 7:   tier, sla = "P0", "15m"
    elif total >= 5: tier, sla = "P1", "4h"
    elif total >= 3: tier, sla = "P2", "2d"
    else:            tier, sla = "P3", "7d"

    return {**entry, "score": total, "tier": tier, "sla": sla}

raw      = parse_raw_errors(f"{REPORT_DIR}/raw_errors.txt")
total_rq = sum(e["count"] for e in raw)
scored   = sorted([score_error(e, total_rq) for e in raw],
                   key=lambda x: x["score"], reverse=True)

# Emit routing manifest
manifest = {
    "generated": "$(date -u +%Y-%m-%dT%H:%M:%SZ)",
    "total_requests_sampled": total_rq,
    "routes": scored
}
out = Path(REPORT_DIR) / "routing_manifest.yaml"
out.write_text(yaml.dump(manifest, default_flow_style=False, sort_keys=False))
print(f"[triage] Routing manifest written to {out}")
PYEOF

echo "[triage] Done. Review ${REPORT_DIR}/routing_manifest.yaml before dispatching tickets."

Key decisions in the implementation:

revenue_critical() checks the URL prefix against a configurable list — adjust REVENUE_PATHS to match your checkout and lead-gen routes before running.
score_error() applies four additive axes. Revenue weight dominates (0–4) so any error on /checkout/* immediately reaches P0 regardless of frequency.
The frequency axis uses request share rather than raw count, so a low-traffic site does not over-promote sparse errors.
upstream_status logging should be enabled in nginx ($upstream_status variable) to catch CDN-masked origin failures before they reach $status in column 9.

The routing manifest YAML is the input format consumed by the ticket-routing step of the risk scoring framework.

Verification and smoke test

After running the triage script, confirm each stage produced expected output.

#!/usr/bin/env bash
set -euo pipefail

REPORT_DIR="/tmp/triage-$(date +%Y%m%d)"

# 1. Confirm raw error file is non-empty
[[ -s "${REPORT_DIR}/raw_errors.txt" ]] \
  && echo "PASS: raw_errors.txt populated" \
  || { echo "FAIL: raw_errors.txt is empty — check nginx log path"; exit 1; }

# 2. Confirm hydration JSON parsed correctly
python3 -c "
import json, sys
data = json.load(open('${REPORT_DIR}/hydration_check.json'))
failures = [r for r in data if not r['hydration_ok']]
print(f'PASS: {len(data)} pages checked, {len(failures)} hydration failure(s)')
"

# 3. Confirm at least one P0 or P1 entry in routing manifest (warns, not fails)
python3 -c "
import yaml
manifest = yaml.safe_load(open('${REPORT_DIR}/routing_manifest.yaml'))
tiers = [r['tier'] for r in manifest.get('routes', [])]
p0 = tiers.count('P0'); p1 = tiers.count('P1')
print(f'INFO: P0={p0}, P1={p1}, P2={tiers.count(\"P2\")}, P3={tiers.count(\"P3\")}')
"

# 4. Validate YAML structure
python3 -c "
import yaml, sys
m = yaml.safe_load(open('${REPORT_DIR}/routing_manifest.yaml'))
assert 'routes' in m, 'missing routes key'
assert 'total_requests_sampled' in m, 'missing total_requests_sampled'
print('PASS: routing_manifest.yaml structure valid')
"

Expected output on a healthy run:

PASS: raw_errors.txt populated
PASS: 4 pages checked, 0 hydration failure(s)
INFO: P0=2, P1=5, P2=18, P3=47
PASS: routing_manifest.yaml structure valid

Failure signal: if raw_errors.txt is empty, the nginx access.log path is wrong or the log rotation window has already archived today's file. Check /var/log/nginx/ for date-stamped rotated logs and update LOG_DIR accordingly. After any fix, re-run the full script — do not patch the manifest by hand.

Failure modes

1. CDN edge cache masks origin 5xx errors

Symptom: raw_errors.txt shows no 5xx entries but users report intermittent server errors.

Diagnostic:

# Requires $upstream_status in nginx log_format
grep ' 50[0-9] ' /var/log/nginx/access.log | awk '{print $10}' | sort | uniq -c
# Column 10 = upstream_status when using combined_upstream log format

Fix: Add $upstream_status to the nginx log_format directive and reload nginx without a full restart:

nginx -t && nginx -s reload

Then flush the CDN cache for the affected paths so real origin responses propagate to users before the next crawl.

2. YAML import fails — `PyYAML` not installed

Symptom: ModuleNotFoundError: No module named 'yaml'

Diagnostic:

/opt/triage-venv/bin/pip show pyyaml

Fix:

/opt/triage-venv/bin/pip install "PyYAML==6.0.1"

Pin the version in your requirements.txt so the dependency lock file keeps crawl artifact storage reproducible across CI and local environments.

3. Playwright browser launch fails in headless CI

Symptom: playwright._impl._api_types.Error: Executable doesn't exist or sandbox permission errors.

Diagnostic:

/opt/triage-venv/bin/python -m playwright install-deps chromium 2>&1 | tail -20

Fix: Install the system dependencies and pin the browser binary:

/opt/triage-venv/bin/playwright install chromium
/opt/triage-venv/bin/playwright install-deps chromium

In Docker-based CI environments, add --no-sandbox to the Chromium launch args and ensure the container image includes libnss3 and libatk-bridge2.0-0. This is particularly relevant when integrating crawlers with CI/CD pipelines.

FAQ

Should all 404 errors be treated as critical?

No. A 404 on a path that has received zero organic sessions in the past 90 days and carries no inbound links scores low on both the frequency and revenue-weight axes — it will land at P3 and be scheduled for the backlog. Only 404s on revenue-critical paths, high-traffic editorial content, or pages with significant backlink equity warrant P1 or higher treatment. The aligning audit goals with business KPIs cluster covers how to pull GSC and GA4 session data to populate that traffic-weight axis automatically.

How often should the risk matrix weights be recalibrated?

Recalibrate the REVENUE_PATHS list and score thresholds whenever a major site migration occurs, after a significant traffic pattern shift (campaign launch, seasonality), or whenever post-fix validation shows systematic misclassification — typically once per quarter under normal operating conditions. When calibrating error thresholds for different site sections, export the scoring weights as environment variables so they can be updated per deployment without editing the script.

Can the same script handle both static and JS-rendered pages?

Yes, in sequence. The Bash awk step handles static HTTP status codes directly from nginx or CDN logs. For JS-rendered pages where a 200 response masks a hydration failure, the Python Playwright step is required. Merge both result sets on the URL key before scoring — the score_error() function processes hydration failures as synthetic 520 entries so they receive the same matrix treatment as real HTTP errors.

What if the routing manifest needs to feed directly into Jira?

Parse routing_manifest.yaml in a separate step and POST to the Jira REST API using the tier-to-priority mapping (P0→Highest, P1→High, P2→Medium, P3→Low) and the SLA field as the duedate. Keep the manifest write and the Jira dispatch as separate idempotent steps so a Jira API outage does not block the triage run — the manifest is the source of truth, Jira is a consumer.

Risk Scoring Frameworks for Technical Debt — parent cluster covering composite severity scoring, threshold configuration, and alert routing
Calibrating Error Thresholds for Different Site Sections — per-section threshold tuning to reduce false positives in the priority matrix
Integrating Custom Crawlers with CI/CD Pipelines — wiring triage scripts into pre-merge validation gates

Prioritizing Critical vs Non-Critical Site Errors #

Error severity decision flow #

Environment isolation and dependency declaration #

Implementation #

Verification and smoke test #

Failure modes #

1. CDN edge cache masks origin 5xx errors #

2. YAML import fails — PyYAML not installed #

3. Playwright browser launch fails in headless CI #

FAQ #

Related #

Prioritizing Critical vs Non-Critical Site Errors

Error severity decision flow

Environment isolation and dependency declaration

Implementation

Verification and smoke test

Failure modes

1. CDN edge cache masks origin 5xx errors

2. YAML import fails — `PyYAML` not installed

3. Playwright browser launch fails in headless CI

FAQ

Related