16 min read

Designing Custom Health Score Algorithms

Problem Framing

Without a reproducible algorithm translating raw metrics into a single auditable number, site health degrades silently: performance regressions hide behind dashboard averages, crawl errors accumulate in low-priority sections, and cross-team accountability collapses because "health" means something different to every stakeholder. SREs end up firefighting incidents that a well-calibrated composite score would have flagged a week earlier. SEO engineers lose confidence in trend data when the scoring formula changes without a paper trail.

This page walks through the full pipeline — ingestion, normalization, scheduling, artifact storage, and verification — to produce a composite health score that is deterministic, auditable, and safe to compare across releases. The parent context for all scoring work lives in Metric Scoring & Data Normalization, which covers the broader taxonomy of normalization approaches.

Prerequisites & Environment Setup

The pipeline depends on Python 3.11, a time-series store for raw telemetry, and object storage for scored artifacts. Pin every dependency to a lockfile and export all configuration as environment variables so the pipeline runs identically in local dev and CI.

Pinned tool versions

Tool	Minimum version	Purpose
Python	3.11.9	Scoring pipeline runtime
pandas	2.2.2	Metric aggregation and normalization
pydantic	2.7.1	Payload schema validation
numpy	1.26.4	Geometric mean computation
pyarrow	16.0.0	Parquet artifact serialization
flock (util-linux)	2.38	Concurrency guard on scoring runs

Required environment variables

# /etc/scoring-pipeline/env  — sourced by the cron wrapper
export SCORING_DB_URL="postgresql://scorer:${SCORING_DB_PASS}@db.internal:5432/audit_metrics"
export TELEMETRY_BUCKET="gs://site-audit-telemetry-prod"
export ARTIFACT_BUCKET="gs://site-audit-artifacts-prod"
export WEIGHTS_CONFIG="/opt/scoring/weights_matrix.json"
export SCORING_ENV="production"
export TZ="UTC"

Lockfile pattern — commit requirements.txt generated with pip-compile (pip-tools):

pip-compile \
  --generate-hashes \
  --output-file /opt/scoring/requirements.txt \
  /opt/scoring/requirements.in

Step 1 — Initialization: Ingestion Layer

Establish a deterministic ingestion layer before any algorithmic processing begins. Validate raw crawl exports, server log streams, and synthetic monitoring payloads against strict schemas. Execute the ETL workflow sequentially: extract → validate → normalize → score. Reject malformed records at the gateway; never let a bad payload silently corrupt downstream scores.

# /opt/scoring/ingest.py
# Requires: pydantic==2.7.1, pandas==2.2.2
from __future__ import annotations
import json
import sys
from pathlib import Path
from typing import Optional

import pandas as pd
from pydantic import BaseModel, ValidationError, field_validator


class TelemetryPayload(BaseModel):
    url: str
    timestamp: str               # ISO-8601, UTC
    lcp_ms: Optional[float]      # Largest Contentful Paint in milliseconds
    cls_score: Optional[float]   # Cumulative Layout Shift (unitless)
    inp_ms: Optional[float]      # Interaction to Next Paint in milliseconds
    wcag_violations: int         # Count of WCAG 2.1 AA violations

    @field_validator("lcp_ms", "inp_ms")
    @classmethod
    def cap_latency(cls, v: Optional[float]) -> Optional[float]:
        """Clamp latency values to a credible range; drop sensor noise."""
        return max(0.0, min(v, 30_000.0)) if v is not None else v

    @field_validator("cls_score")
    @classmethod
    def cap_cls(cls, v: Optional[float]) -> Optional[float]:
        return max(0.0, min(v, 1.0)) if v is not None else v


def validate_telemetry_payload(raw_json: dict) -> TelemetryPayload:
    try:
        return TelemetryPayload(**raw_json)
    except ValidationError as exc:
        raise ValueError(f"Schema violation: {exc}") from exc


def load_and_validate(ndjson_path: Path) -> pd.DataFrame:
    """Parse an NDJSON telemetry export and return a clean DataFrame."""
    records: list[dict] = []
    rejected = 0
    with ndjson_path.open() as fh:
        for line in fh:
            line = line.strip()
            if not line:
                continue
            try:
                payload = validate_telemetry_payload(json.loads(line))
                records.append(payload.model_dump())
            except (ValueError, json.JSONDecodeError) as exc:
                print(f"REJECTED: {exc}", file=sys.stderr)
                rejected += 1
    print(f"Loaded {len(records)} records, rejected {rejected}", file=sys.stderr)
    return pd.DataFrame(records)

Implement idempotent ingestion handlers: hash each payload on (url, timestamp) and skip records already present in the database. This prevents duplicate scoring during pipeline retries caused by flaky network writes or partial CI reruns.

Step 2 — Core Configuration: Dynamic Metric Weighting

Raw metrics operate on incompatible scales — milliseconds, unitless ratios, integer counts. Before combining them, apply z-score transformation to center each distribution, then min-max scaling to bound all outputs to [0, 1]. Only then apply the weight matrix.

Key parameters table

Parameter	Type	Default	Purpose
`indexability_weight`	float	0.35	Fraction of composite score attributed to crawl/indexability signals
`performance_weight`	float	0.35	Fraction attributed to Core Web Vitals (LCP, CLS, INP)
`accessibility_weight`	float	0.30	Fraction attributed to WCAG 2.1 AA compliance
`decay_halflife_days`	int	30	Days until an observation's temporal weight halves
`outlier_iqr_multiplier`	float	1.5	IQR multiplier for pre-normalization outlier capping
`geo_mean_epsilon`	float	1e-9	Additive constant preventing log(0) in geometric mean

Store these values in a versioned JSON file tracked in Git, not hardcoded in the scoring script:

// /opt/scoring/weights_matrix.json  (version-controlled)
{
  "version": "2.3.0",
  "weights": {
    "indexability_score": 0.35,
    "performance_score": 0.35,
    "accessibility_score": 0.30
  },
  "normalization": {
    "method": "minmax",
    "outlier_iqr_multiplier": 1.5
  },
  "temporal": {
    "decay_halflife_days": 30,
    "window_days": 90
  }
}

# /opt/scoring/score.py
# Requires: numpy==1.26.4, pandas==2.2.2
from __future__ import annotations
import json
import os
from pathlib import Path

import numpy as np
import pandas as pd


def load_weights(config_path: str | None = None) -> dict:
    path = Path(config_path or os.environ["WEIGHTS_CONFIG"])
    with path.open() as fh:
        return json.load(fh)


def cap_outliers(series: pd.Series, iqr_multiplier: float = 1.5) -> pd.Series:
    """Winsorise extreme values before normalization."""
    q1, q3 = series.quantile(0.25), series.quantile(0.75)
    iqr = q3 - q1
    return series.clip(lower=q1 - iqr_multiplier * iqr, upper=q3 + iqr_multiplier * iqr)


def apply_min_max_scaling(series: pd.Series) -> pd.Series:
    lo, hi = series.min(), series.max()
    if hi == lo:
        return pd.Series(1.0, index=series.index)
    return (series - lo) / (hi - lo)


def calculate_weighted_geometric_mean(
    df: pd.DataFrame,
    weights: dict[str, float],
    epsilon: float = 1e-9,
) -> pd.Series:
    """
    Weighted geometric mean across metric columns.
    Penalises severe outliers more aggressively than arithmetic averaging.
    """
    cols = list(weights.keys())
    scaled = df[cols].apply(apply_min_max_scaling)
    log_scaled = np.log(scaled + epsilon)
    weight_vector = np.array([weights[c] for c in cols], dtype=float)
    weight_vector /= weight_vector.sum()          # normalise to sum=1
    return np.exp((log_scaled * weight_vector).sum(axis=1))


def apply_time_decay(df: pd.DataFrame, halflife_days: int) -> pd.Series:
    """Exponential time-decay weight: recent observations count more."""
    age_days = (pd.Timestamp.utcnow() - pd.to_datetime(df["timestamp"], utc=True)).dt.days
    return np.exp(-np.log(2) * age_days / halflife_days)

See How to Weight Core Web Vitals in Custom Dashboards for a worked example of tuning the performance_weight split across LCP, CLS, and INP sub-components.

Scoring Pipeline Architecture

The diagram below shows how raw telemetry moves through the pipeline stages — validation, normalization, weighted scoring, and artifact storage — before feeding the alert routing layer.

Step 3 — Execution & Scheduling

Run daily incremental updates (new records since last run) and full weekly recalculations (reprocess the 90-day window to pick up weight matrix changes). Use flock to prevent concurrent scoring runs from writing conflicting artifacts.

#!/usr/bin/env bash
# /opt/scoring/run_scoring.sh
set -euo pipefail

LOCKFILE="/var/lock/site-health-scoring.lock"
LOG="/var/log/site-health-scoring/$(date -u +%Y%m%d-%H%M%S).log"
SCRIPT_DIR="/opt/scoring"

exec 200>"${LOCKFILE}"
flock -n 200 || { echo "Scoring already running — exiting." >&2; exit 1; }

mkdir -p "$(dirname "${LOG}")"

{
  echo "=== Scoring run started at $(date -u +%Y-%m-%dT%H:%M:%SZ) ==="
  echo "SCORING_ENV=${SCORING_ENV}"
  echo "WEIGHTS_CONFIG=${WEIGHTS_CONFIG}"

  # Tag this run with the current Git SHA and config hash
  export SCORING_VERSION
  SCORING_VERSION=$(git -C "${SCRIPT_DIR}" rev-parse --short HEAD)
  export CONFIG_HASH
  CONFIG_HASH=$(sha256sum "${WEIGHTS_CONFIG}" | awk '{print $1}')
  echo "SCORING_VERSION=${SCORING_VERSION} CONFIG_HASH=${CONFIG_HASH}"

  python3 "${SCRIPT_DIR}/ingest.py" --mode incremental
  python3 "${SCRIPT_DIR}/score.py" \
    --weights "${WEIGHTS_CONFIG}" \
    --output-bucket "${ARTIFACT_BUCKET}" \
    --scoring-version "${SCORING_VERSION}" \
    --config-hash "${CONFIG_HASH}"

  echo "=== Scoring run completed at $(date -u +%Y-%m-%dT%H:%M:%SZ) ==="
} 2>&1 | tee "${LOG}"

Cron entries — add to /etc/cron.d/site-health-scoring:

# Daily incremental scoring at 03:15 UTC
15 3 * * * scoring-user bash /opt/scoring/run_scoring.sh incremental

# Full weekly recalculation on Sunday at 04:00 UTC (reprocesses 90-day window)
0 4 * * 0   scoring-user bash /opt/scoring/run_scoring.sh full

When running inside a CI pipeline instead of cron, use the same flock guard via a workflow-level concurrency key so parallel PR runs do not race on the same artifact prefix.

For strategies on integrating this scoring pipeline with a CI/CD system, see the CI/CD integration guide, which covers environment-variable injection and artifact promotion across staging and production.

Step 4 — Artifact Capture & Storage

Write each scoring run's output as a date-partitioned Parquet file. Tag every artifact with the Git commit SHA and configuration hash so any score in history can be reproduced by checking out the corresponding commit and replaying the pipeline against the archived telemetry.

# /opt/scoring/artifact.py
# Requires: pyarrow==16.0.0, pandas==2.2.2
from __future__ import annotations
import hashlib
import json
import os
from datetime import datetime, timezone
from pathlib import Path

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


RETENTION_DAYS = 90   # delete artifacts older than this on the next purge run


def write_scored_artifact(
    scored_df: pd.DataFrame,
    scoring_version: str,
    config_hash: str,
    bucket: str,
    run_date: datetime | None = None,
) -> str:
    """
    Serialise a scored DataFrame to Parquet and write to object storage.
    Returns the artifact path for downstream consumers.
    """
    run_date = run_date or datetime.now(tz=timezone.utc)
    partition = run_date.strftime("%Y/%m/%d")

    # Embed provenance metadata into the Parquet schema
    metadata = {
        b"scoring_version": scoring_version.encode(),
        b"config_hash": config_hash.encode(),
        b"run_timestamp": run_date.isoformat().encode(),
        b"retention_days": str(RETENTION_DAYS).encode(),
    }

    table = pa.Table.from_pandas(scored_df).replace_schema_metadata(metadata)
    artifact_path = (
        f"{bucket}/scored/{partition}/health_scores_{scoring_version}.parquet"
    )
    pq.write_table(table, artifact_path, compression="snappy")
    print(f"Artifact written: {artifact_path}")
    return artifact_path

Retention policy — run a weekly purge job that deletes Parquet files older than RETENTION_DAYS. For storing and versioning crawl artifacts in cloud storage, the artifact versioning guide covers lifecycle rules that automate this in GCS and S3 without a separate script.

Threshold Calibration & Alert Routing

Establish dynamic alerting boundaries adapted to site architecture. Commercial landing pages require stricter boundaries than low-priority utility endpoints. The full calibration methodology is in Calibrating Error Thresholds for Different Site Sections.

Configure adaptive thresholds using rolling percentile baselines. Calculate P95 and P99 values over a 30-day sliding window. Map severity tiers to automated routing rules. Enforce cooldown periods to suppress alert fatigue during active deployment windows.

# /opt/scoring/alert_routing_config.yaml
thresholds:
  commercial:
    lcp_p95: 2500         # milliseconds — Good threshold per CrUX
    cls_p99: 0.15
    inp_p95: 200
  utility:
    lcp_p95: 4000
    cls_p99: 0.25
    inp_p95: 500
  wcag_compliance:
    critical_violations: 0
    warning_violations: 5

routing:
  critical:
    channel: pagerduty-sre
    cooldown_minutes: 30
  warning:
    channel: slack-seo-team
    cooldown_minutes: 60
  info:
    channel: slack-monitoring
    cooldown_minutes: 1440

Verification Checklist

After each scoring run, confirm correctness before treating results as authoritative:

Check the scoring log for the line === Scoring run completed at — absence means the pipeline exited early.
Verify the Parquet artifact exists at the expected partition prefix and is non-empty: gsutil ls -l "gs://site-audit-artifacts-prod/scored/$(date -u +%Y/%m/%d)/".
Assert the scoring_version metadata field in the artifact matches the Git HEAD: python3 -c "import pyarrow.parquet as pq; m=pq.read_metadata('path/to/file.parquet').metadata; print(m[b'scoring_version'])".
Run a score-distribution sanity check — composite scores outside [0.05, 0.99] on a healthy site indicate a normalization defect or a data ingestion failure.
Diff the P50 composite score against the previous run: a delta greater than 0.10 without a corresponding deployment event warrants investigation. Use the tracking metric trends across release cycles approach to correlate score shifts with CI events.
Confirm the dead-letter queue count did not increase by more than 2% of total records — a spike signals upstream schema drift in the telemetry source.

Troubleshooting

Composite scores are flat (all pages score ~0.95)

Root cause: the min-max scaler collapses when the input distribution has very low variance — often caused by a data ingestion gap that left only one day of records in the scoring window.

# Check record counts per day in the scoring window
python3 -c "
import pandas as pd
df = pd.read_parquet('/tmp/latest_telemetry.parquet')
print(df.groupby(df['timestamp'].str[:10]).size())
"

Fix: verify the ingestion cron ran for the past 7 days and re-run the full recalculation after confirming data completeness.

Schema validation rejects >5% of records after a telemetry source update

Root cause: the upstream source changed a field name or unit (e.g. lcp_ms renamed to lcp_milliseconds).

# Inspect the first rejected record in the dead-letter queue
gsutil cat "gs://site-audit-telemetry-prod/dead-letter/latest.ndjson" | head -1 | python3 -m json.tool

Fix: update the TelemetryPayload model in ingest.py to accept both the old and new field names using a model_validator, then coordinate with the telemetry producer to agree on a cutover date.

Scoring run blocked — lockfile not released

Root cause: a previous run crashed mid-execution and the lock was not released, or a zombie process holds the file descriptor.

# Find the PID holding the lockfile
flock --nonblock /var/lock/site-health-scoring.lock echo "lock is free" || \
  lsof /var/lock/site-health-scoring.lock

Fix: kill the stale process and delete the lockfile only after confirming no artifact write is in progress.

Parquet artifact checksum mismatch between staging and production

Root cause: the weights_matrix.json config used in staging differs from production — usually a local edit that was not committed.

# Compare config hashes across environments
ssh staging-host "sha256sum /opt/scoring/weights_matrix.json"
ssh prod-host    "sha256sum /opt/scoring/weights_matrix.json"
git show HEAD:weights_matrix.json | sha256sum

Fix: commit the production-validated weights_matrix.json and deploy it through the standard release pipeline.

Time-decay weights make recent score history unusable

Root cause: decay_halflife_days set too aggressively (e.g. 7 days) so that a two-week-old data point contributes near-zero weight, inflating variance in the composite score.

# Inspect effective decay weights for your current window
python3 -c "
import numpy as np, pandas as pd
halflife = 7   # current setting
ages = range(0, 91)
weights = [np.exp(-np.log(2)*d/halflife) for d in ages]
for d, w in zip(ages, weights):
    print(f'day -{d:3d}: weight={w:.4f}')
" | head -30

Fix: increase decay_halflife_days to 30 (the recommended default) in weights_matrix.json and commit the change.

Algorithm version mismatch causes trend discontinuity

Root cause: scoring_version in artifact metadata does not match the Git SHA used to score the historical baseline, making longitudinal comparison invalid.

# Audit provenance of the last 10 artifacts
gsutil ls -r "gs://site-audit-artifacts-prod/scored/**/*.parquet" | tail -10 | while read path; do
  python3 -c "
import pyarrow.parquet as pq, sys
m = pq.read_metadata('${path}').metadata
print('${path}', m.get(b'scoring_version', b'unknown').decode())
"
done

Fix: if a batch of artifacts was scored with an untagged version, re-run the full recalculation after pinning the correct SCORING_VERSION in the environment.

Common Mistakes

Hardcoding static weights that ignore seasonal traffic variance — device mix and crawl frequency shift around product launches and holiday periods.
Mixing incompatible scales without normalization, such as combining milliseconds with boolean flags in the same geometric mean.
Skipping device stratification before normalization, which masks mobile performance degradation behind desktop averages. See normalizing performance data across device types for the stratification pattern.
Omitting algorithm versioning, which destroys historical trend analysis when the weight matrix changes.
Deploying static thresholds that trigger over-alerting across complex page architectures — use the rolling percentile baseline approach instead.
Setting decay_halflife_days below 14, which causes composite scores to oscillate with normal weekly traffic cycles rather than reflecting genuine health changes.

Why use a weighted geometric mean instead of an arithmetic average for health scores?

A weighted geometric mean penalizes severe outliers more aggressively. A single catastrophic LCP regression pulls the composite score down proportionally rather than being diluted by well-performing metrics, which better reflects real user impact. With an arithmetic mean, a site with one 8,000 ms LCP page and nineteen 1,200 ms pages would score only slightly below a site where every page loads in under 2,000 ms.

How often should weight matrices be recalibrated?

Recalibrate after any major product release, a significant traffic mix shift (e.g. a new device category crossing 20% share), or when a metric's distribution drifts beyond two standard deviations from its 90-day rolling baseline. Treat recalibration as a release: commit the new weights_matrix.json, tag the commit, and document the business rationale in the commit message.

Can I score pages with missing metric values?

Yes. Treat null values as imputed using the P75 of the same URL's 30-day rolling window. If no history exists, fall back to the site-wide median. Document the imputation policy in weights_matrix.json under an "imputation" key so reviewers understand score lineage and auditors can reproduce historical scores.

What storage backend is best for scoring artifacts?

Parquet on object storage (S3, GCS, or R2) with date-partitioned prefixes is the standard choice. It supports columnar queries, integrates with BigQuery and ClickHouse, and keeps file sizes manageable for daily incremental writes. Cache frequently accessed weight matrices in Redis to avoid repeated object-storage reads during high-frequency scoring windows.

Metric Scoring & Data Normalization — parent section covering the full normalization and scoring taxonomy
How to Weight Core Web Vitals in Custom Dashboards — worked example for splitting the performance weight across LCP, CLS, and INP
Calibrating Error Thresholds for Different Site Sections — adaptive threshold configuration for commercial vs utility pages
Normalizing Performance Data Across Device Types — device stratification before aggregation
Tracking Metric Trends Across Release Cycles — correlating score shifts with CI/CD deployment events

Designing Custom Health Score Algorithms #

Problem Framing #

Prerequisites & Environment Setup #

Step 1 — Initialization: Ingestion Layer #

Step 2 — Core Configuration: Dynamic Metric Weighting #

Scoring Pipeline Architecture #

Step 3 — Execution & Scheduling #

Step 4 — Artifact Capture & Storage #

Threshold Calibration & Alert Routing #

Verification Checklist #

Troubleshooting #

Common Mistakes #

Related #