5 min read

Setting Up a Cron Job for Weekly Site Crawls

Ad-hoc crawl execution creates inconsistent data snapshots. Teams miss regression windows. Crawl budget consumption goes untracked. You must implement deterministic cron scheduling. Isolate the execution environment. Resolve paths explicitly. Structure logs for downstream analysis.

1. Environment Isolation & Dependency Resolution

Cron strips standard shell variables. Missing $PATH values break virtual environments. Define a strict execution context. Align base dependencies with the broader Automated Crawling & Pipeline Tooling architecture. This guarantees consistent runtime behavior across dev, stage, and prod. Use absolute paths for every binary, config file, and lock directory.

Wrap the crawler in a deterministic bash script. Activate the virtual environment explicitly. Export required paths before invoking the binary.

#!/usr/bin/env bash
set -euo pipefail

# Absolute path declarations
CRAWLER_BIN="/opt/crawler/bin/crawl_engine"
CONFIG_PATH="/etc/crawler/weekly.yaml"
VENV_PATH="/opt/crawler/venv"
LOG_DIR="/var/log/crawler"

# Environment isolation
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
source "${VENV_PATH}/bin/activate"

# Execute crawler
exec "${CRAWLER_BIN}" --config "${CONFIG_PATH}"

Validate the wrapper before scheduling. Run the script manually with env -i to simulate the stripped cron environment. Verify that which python3 resolves to the correct virtualenv binary. Confirm dependency lockfile checksums match the production baseline.

Roll back immediately if dependencies fail. Comment out the wrapper execution line in crontab. Restore the previous virtualenv snapshot from backup. Revert to system-wide package management if the isolated environment remains unstable.

2. Cron Expression & Timezone Configuration

Replace the ambiguous @weekly macro. Use explicit 5-field cron syntax to enforce deterministic execution windows. Configure CRON_TZ to prevent UTC and local drift during daylight saving transitions. Implement flock-based concurrency guards. This prevents overlapping weekly executions from exhausting memory.

Deploy the scheduler with explicit timezone guards and file locking. Redirect all output to a centralized log stream.

# /etc/crontab or user crontab
CRON_TZ=UTC
0 2 * * 0 /usr/bin/flock -n /tmp/crawl.lock /opt/scripts/weekly_crawl.sh >> /var/log/crawl.log 2>&1

Systemd offers a robust alternative for complex scheduling. Map the timer to the same execution window.

# /etc/systemd/system/crawl-weekly.timer
[Unit]
Description=Weekly Site Crawl Timer

[Timer]
OnCalendar=Sun *-*-* 02:00:00
Timezone=UTC
Persistent=true

[Install]
WantedBy=timers.target

Validate the schedule before activation. Parse the cron expression using croniter or systemd-analyze calendar. Verify lock file creation and release during a dry-run. Confirm log rotation does not truncate active write streams. Implement the following rotation policy to prevent disk exhaustion.

# /etc/logrotate.d/crawler
/var/log/crawl.log {
 weekly
 rotate 12
 compress
 delaycompress
 missingok
 notifempty
 create 0640 crawler crawler
}

Roll back if scheduling conflicts arise. Remove the crontab line via crontab -r or comment it out. Clear stale /tmp/crawl.lock files if the process terminates unexpectedly. Revert to manual execution until the scheduler configuration stabilizes.

3. Execution Wrapper & Artifact Pipeline

Structure the execution pipeline to emit structured logs. Capture explicit exit codes. Route crawl artifacts to designated storage. Map artifact handoff patterns to Integrating Custom Crawlers with CI/CD Pipelines for downstream QA and alerting workflows. Enforce strict timeout boundaries. Runaway processes consume crawl budget and block subsequent jobs.

Wrap the binary with a hard timeout. Format logs as JSON. Upload artifacts with cryptographic checksums.

#!/usr/bin/env bash
set -euo pipefail

TIMEOUT_SECONDS=7200
OUTPUT_DIR="/opt/crawler/artifacts/$(date +%Y%m%d)"
LOG_FILE="/var/log/crawler/execution.log"

mkdir -p "${OUTPUT_DIR}"

# Enforce timeout and redirect structured logs
timeout "${TIMEOUT_SECONDS}" /opt/crawler/bin/run \
 --config /etc/crawler/weekly.yaml \
 --output-dir "${OUTPUT_DIR}" \
 --log-level json 2>&1 | tee -a "${LOG_FILE}"

EXIT_CODE=$?
echo "{\"exit_code\": ${EXIT_CODE}, \"timestamp\": \"$(date -u +%Y-%m-%dT%H:%M:%SZ)\"}" >> "${LOG_FILE}"

Validate pipeline integrity after execution. Confirm exit code 0 on success. Expect exit code 124 on timeout. Validate artifact file size exceeds zero. Verify JSON schema compliance. Check cloud storage upload completion via API response.

Roll back if artifacts corrupt. Kill active crawler processes via pkill -f crawler. Delete incomplete artifacts from the staging bucket. Revert to the previous crawler binary version in /opt/crawler/bin/.

4. Health Monitoring & Alerting Integration

Deploy post-execution validation scripts. Detect silent failures, empty result sets, and rate-limit violations. Integrate with your observability stack. Push custom metrics to Prometheus, Datadog, or CloudWatch. Establish threshold-based alerting for crawl duration, error rates, and artifact size anomalies. Correlate crawl outputs with LCP, CLS, INP, and WCAG compliance baselines to track technical debt.

Validate artifact structure and push metrics to your monitoring endpoint.

# validate_and_push_metrics.py
import json
import os
import sys
import requests

ARTIFACT_PATH = sys.argv[1]
METRIC_ENDPOINT = os.getenv("METRIC_ENDPOINT")

def validate_schema(path):
 size = os.path.getsize(path)
 if size == 0:
 raise ValueError("Empty artifact generated")
 with open(path, 'r') as f:
 data = json.load(f)
 required_keys = {"urls_crawled", "status_codes", "lcp_scores", "wcag_violations"}
 if not required_keys.issubset(data.keys()):
 raise ValueError("Schema mismatch")
 return size

try:
 size = validate_schema(ARTIFACT_PATH)
 requests.post(METRIC_ENDPOINT, json={
 "metric": "crawl_artifact_size_bytes",
 "value": size,
 "status": "success"
 })
except Exception as e:
 requests.post(METRIC_ENDPOINT, json={
 "metric": "crawl_failure_count",
 "value": 1,
 "status": "error",
 "reason": str(e)
 })
 sys.exit(1)

Configure routing rules to trigger alerts on threshold breaches.

# alertmanager.yml snippet
route:
  receiver: 'pagerduty-critical'
  group_by: ['job']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        alertname: CrawlArtifactSizeAnomaly
      receiver: 'slack-crawl-alerts'

Validate monitoring integration. Trigger a synthetic failure to verify alert routing. Confirm metric ingestion latency remains under 60 seconds. Validate dashboard refresh rates match the weekly cadence.

Roll back during alert storms. Disable the alerting rule to prevent pager fatigue. Suppress the metric ingestion endpoint temporarily. Restore the previous monitoring configuration from version control.

Common Mistakes

Avoid these configuration errors. They cause silent failures and wasted compute.

  • Relying on @weekly without explicit timezone declaration. Crawls execute during peak traffic hours due to server timezone mismatch. This triggers rate-limit blocks. Use explicit 0 2 * * 0 syntax with CRON_TZ=UTC or systemd OnCalendar= directives.
  • Missing stderr redirection in crontab. Silent failures go undetected. Logs route to root@localhost instead of centralized logging. Always append >> /var/log/crawl.log 2>&1 or pipe output to logger.
  • No concurrency lock mechanism. Overlapping executions exhaust memory. Duplicate jobs consume crawl budget. Wrap execution with flock -n /tmp/crawl.lock or implement PID file checks.
  • Unbounded timeout configuration. Crawlers hang indefinitely on JS-heavy endpoints. Subsequent weekly jobs block. Enforce timeout command wrappers with hard limits (e.g., 7200s).