10 min read

Setting Up a Cron Job for Weekly Site Crawls

Q: What if flock is not available on my system?

Use a PID-file guard instead: write $$ to /var/run/crawl.pid at startup and check its existence before execution. On Alpine Linux, install util-linux to get flock, or use a Python-based file-lock library if your crawler is Python-native.

Q: How do I change the crawl window timezone?

Set CRON_TZ=America/New_York (or your target zone) directly above the crontab line. For systemd timers, change OnCalendar=Sun *-*-* 02:00:00 America/New_York. Always prefer UTC for servers shared across regions to avoid daylight-saving drift.

Q: Why does my cron job succeed manually but fail when scheduled?

Cron strips the interactive shell environment, removing $PATH, $HOME, and any virtualenv activations. Test with env -i /bin/bash --noprofile --norc /opt/scripts/weekly_crawl.sh to reproduce the exact cron context. Use absolute paths for every binary and config file inside the wrapper.

Q: How do I alert on a missed crawl window?

Use a dead-man's switch: push a heartbeat metric to your monitoring stack at the end of each successful run. Configure an absence alert to fire if no heartbeat arrives within the expected window plus a 30-minute grace period. Prometheus Alertmanager, Datadog monitors, and PagerDuty all support this pattern.

Ad-hoc crawl execution produces inconsistent data snapshots, obscures regression windows, and makes crawl budget consumption impossible to track. This page solves the exact operational problem: turning a manually-triggered crawler into a deterministic weekly job that runs unattended, emits structured logs, archives artifacts, and alerts on failure. It belongs to the Integrating Custom Crawlers with CI/CD Pipelines workflow, which covers the broader pipeline context — environment variable injection, runner images, and downstream QA stages.

Environment Isolation & Dependency Declaration

Cron strips the interactive shell environment. $PATH, $HOME, and virtualenv activations are all absent. The first line of defence is a wrapper script that builds its own execution context from absolute paths, then explicitly activates the crawler's virtual environment before handing off to the binary.

#!/usr/bin/env bash
set -euo pipefail

# ── Absolute path declarations ────────────────────────────────────────────────
CRAWLER_BIN="/opt/crawler/bin/crawl_engine"
CONFIG_PATH="/etc/crawler/weekly.yaml"
VENV_PATH="/opt/crawler/venv"
LOG_DIR="/var/log/crawler"
LOCK_FILE="/var/run/crawl-weekly.lock"

# ── Minimal deterministic PATH (no shell rc files loaded by cron) ─────────────
export PATH="/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
export HOME="/home/crawler"

# ── Activate the pinned virtual environment ───────────────────────────────────
# shellcheck source=/opt/crawler/venv/bin/activate
source "${VENV_PATH}/bin/activate"

# ── Confirm correct Python resolves before execution ─────────────────────────
RESOLVED_PY="$(which python3)"
if [[ "${RESOLVED_PY}" != "${VENV_PATH}/bin/python3" ]]; then
  echo "ERR: unexpected python3 at ${RESOLVED_PY}" >&2
  exit 1
fi

exec "${CRAWLER_BIN}" --config "${CONFIG_PATH}" --log-dir "${LOG_DIR}"

Pin the crawler and all its dependencies in requirements.txt with exact versions (==), not ranges. Keep the lockfile in version control alongside the wrapper script so any environment drift is caught at review time.

Implementation

Cron expression with timezone guard and concurrency lock

Replace the ambiguous @weekly macro with an explicit 5-field expression. @weekly silently inherits the system timezone, which shifts the window by hours during daylight-saving transitions and can collide with peak traffic periods.

# /etc/cron.d/crawl-weekly
# ┌─ minute  ┌─ hour  ┌─ day-of-month  ┌─ month  ┌─ day-of-week (0=Sun)
CRON_TZ=UTC
0 2 * * 0 crawler /usr/bin/flock -n /var/run/crawl-weekly.lock /opt/scripts/weekly_crawl.sh >> /var/log/crawler/cron.log 2>&1

flock -n exits immediately (non-zero) if the lock is held, so a previous run that overran its window does not spawn a parallel job. Log this non-zero exit separately if you want visibility into skipped windows.

Systemd alternative — use this when the host already manages services with systemd, or when you need Persistent=true to catch up on missed windows after a reboot:

# /etc/systemd/system/crawl-weekly.timer
[Unit]
Description=Weekly Site Crawl — UTC Sunday 02:00

[Timer]
OnCalendar=Sun *-*-* 02:00:00 UTC
Persistent=true
RandomizedDelaySec=120

[Install]
WantedBy=timers.target

# /etc/systemd/system/crawl-weekly.service
[Unit]
Description=Weekly Site Crawl
After=network-online.target

[Service]
Type=oneshot
User=crawler
ExecStartPre=/usr/bin/flock -n /var/run/crawl-weekly.lock /bin/true
ExecStart=/opt/scripts/weekly_crawl.sh
TimeoutStartSec=7200
StandardOutput=journal
StandardError=journal

Execution wrapper with timeout, structured logs, and artifact pipeline

A hard timeout prevents a JS-heavy endpoint from hanging the job indefinitely and blocking the following week's window. The wrapper emits a JSON summary record on every exit — success or failure — so your monitoring stack can ingest it without parsing unstructured text.

When storing crawl artifacts in cloud storage, the dated OUTPUT_DIR below maps directly to the storage key prefix, giving you a one-to-one correspondence between job execution and stored artifact without extra naming logic.

#!/usr/bin/env bash
set -euo pipefail

TIMEOUT_SECONDS=7200
RUN_DATE="$(date -u +%Y%m%d)"
OUTPUT_DIR="/opt/crawler/artifacts/${RUN_DATE}"
LOG_FILE="/var/log/crawler/execution.log"
METRIC_ENDPOINT="${METRIC_ENDPOINT:-http://localhost:9091/metrics/job/crawl_weekly}"

mkdir -p "${OUTPUT_DIR}"

START_TS="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Run crawler with hard timeout; capture exit code without triggering set -e
set +e
timeout "${TIMEOUT_SECONDS}" /opt/crawler/bin/crawl_engine \
  --config /etc/crawler/weekly.yaml \
  --output-dir "${OUTPUT_DIR}" \
  --log-level json \
  2>&1 | tee -a "${LOG_FILE}"
EXIT_CODE=${PIPESTATUS[0]}
set -e

END_TS="$(date -u +%Y-%m-%dT%H:%M:%SZ)"

# Emit a structured summary record — ingested by Prometheus pushgateway / Datadog
printf '{"event":"crawl_complete","exit_code":%d,"run_date":"%s","start":"%s","end":"%s","output_dir":"%s"}\n' \
  "${EXIT_CODE}" "${RUN_DATE}" "${START_TS}" "${END_TS}" "${OUTPUT_DIR}" \
  >> "${LOG_FILE}"

# Push exit-code metric to pushgateway if available
curl -sf --max-time 5 \
  --data-binary "crawl_exit_code{job=\"crawl_weekly\",run_date=\"${RUN_DATE}\"} ${EXIT_CODE}" \
  "${METRIC_ENDPOINT}" || true   # metric push failure must not abort the job record

exit "${EXIT_CODE}"

Exit code 0 = success. Exit code 124 = the timeout command's signal — the crawl exceeded TIMEOUT_SECONDS. Any other non-zero code originates from the crawler binary itself.

Log rotation

Prevent disk exhaustion without truncating an active write stream:

# /etc/logrotate.d/crawl-weekly
/var/log/crawler/*.log {
    weekly
    rotate 12
    compress
    delaycompress
    missingok
    notifempty
    create 0640 crawler crawler
}

delaycompress defers gzip by one rotation cycle, so the file being written during the current week is never simultaneously compressed.

Verification & Smoke Test

Run these commands after deploying the cron entry or enabling the systemd timer to confirm the full pipeline executes correctly before the first scheduled window fires.

# 1. Simulate the stripped cron execution environment
env -i HOME=/home/crawler PATH=/usr/local/bin:/usr/bin:/bin \
  /bin/bash /opt/scripts/weekly_crawl.sh

# 2. Confirm the correct Python binary resolves inside the wrapper
env -i HOME=/home/crawler PATH=/usr/local/bin:/usr/bin:/bin \
  /bin/bash -c 'source /opt/crawler/venv/bin/activate && which python3'
# Expected: /opt/crawler/venv/bin/python3

# 3. Verify artifact output exists and is non-empty
ARTIFACT_DIR="/opt/crawler/artifacts/$(date -u +%Y%m%d)"
[[ -d "${ARTIFACT_DIR}" ]] && find "${ARTIFACT_DIR}" -name "*.json" -size +0c | head -5

# 4. Parse the cron expression independently
python3 -c "
from croniter import croniter
import datetime
c = croniter('0 2 * * 0', datetime.datetime.utcnow())
print('Next 3 runs (UTC):', [c.get_next(datetime.datetime) for _ in range(3)])
"

# 5. For systemd: preview the next fire times before enabling
systemd-analyze calendar 'Sun *-*-* 02:00:00 UTC'

# 6. Check the structured log record for the last run
tail -20 /var/log/crawler/execution.log | python3 -m json.tool

Successful output from step 3 is at least one .json file larger than zero bytes in ARTIFACT_DIR. The structured log record in step 6 should show "exit_code": 0.

Failure Modes

flock: /var/run/crawl-weekly.lock: Permission denied

The lock file was created by a previous user or process. Diagnose with ls -la /var/run/crawl-weekly.lock and lsof /var/run/crawl-weekly.lock. If no process holds it, remove it with rm /var/run/crawl-weekly.lock and verify the cron job runs as the crawler user. Fix the crontab user field if needed.

# Identify holder
lsof /var/run/crawl-weekly.lock

# Remove stale lock only after confirming no active holder
rm -f /var/run/crawl-weekly.lock

Exit code 124 — crawl timed out

The site returned unusually slow responses, or dynamic content in automated crawls caused the headless browser to wait indefinitely for JavaScript hydration. Check execution.log for the last URL processed before the timeout. Increase TIMEOUT_SECONDS as a short-term fix; adjust the crawler's per-page timeout and reduce concurrency as a durable solution.

# Find last crawled URL before timeout
grep '"url"' /var/log/crawler/execution.log | tail -5

Artifact directory empty despite exit code 0

The crawler exited cleanly but wrote no output — typically a misconfigured --output-dir flag or a permissions mismatch. Confirm the crawler user owns the artifacts directory and that the path passed to --output-dir matches the directory created by mkdir -p.

stat /opt/crawler/artifacts/$(date -u +%Y%m%d)
# Expected: Uid matches the crawler user

FAQ

What if flock is not available on my system?

Use a PID-file guard instead:

PIDFILE="/var/run/crawl-weekly.pid"
if [[ -f "${PIDFILE}" ]] && kill -0 "$(cat "${PIDFILE}")" 2>/dev/null; then
  echo "Crawl already running (PID $(cat "${PIDFILE}")). Exiting." >&2
  exit 0
fi
echo $$ > "${PIDFILE}"
trap 'rm -f "${PIDFILE}"' EXIT

On Alpine Linux, install util-linux to get flock: apk add util-linux.

How do I change the crawl window timezone?

In crontab, set CRON_TZ=America/New_York (or any tz database name) directly above the job line. For systemd timers, change the OnCalendar line: OnCalendar=Sun *-*-* 02:00:00 America/New_York. Prefer UTC for servers shared across regions — it avoids daylight-saving drift entirely and keeps artifact timestamps unambiguous when comparing runs across months.

Why does my cron job succeed manually but fail when scheduled?

Cron provides a minimal environment with no shell rc files loaded. The most common cause is a missing $PATH entry that prevents the virtual environment, Python, or a CLI tool from being found. Reproduce the exact context with:

env -i HOME=/home/crawler PATH=/usr/local/bin:/usr/bin:/bin \
  /bin/bash --noprofile --norc /opt/scripts/weekly_crawl.sh

If it fails here, add explicit export statements for every required variable inside the wrapper script.

How do I alert on a missed crawl window?

Implement a dead-man's switch: push a heartbeat metric to your monitoring stack as the final step of a successful run, then configure an absence alert to fire if no heartbeat arrives within the window plus a 30-minute grace period. In Prometheus Alertmanager this is an absent() rule; Datadog calls it a "no-data" monitor. PagerDuty supports heartbeat monitors natively via its Events API.

Integrating Custom Crawlers with CI/CD Pipelines — parent workflow covering runner images, environment variable injection, and pipeline stage structure
Managing Crawl Budget & Rate Limiting — set rate-limit controls before enabling weekly scheduling to avoid triggering bot-detection rules
Storing & Versioning Crawl Artifacts in Cloud Storage — upload, version, and diff the artifacts produced by this job
Handling Dynamic Content in Automated Crawls — tune headless browser wait conditions to prevent timeout exits on JS-heavy pages

Setting Up a Cron Job for Weekly Site Crawls #

Environment Isolation & Dependency Declaration #

Implementation #

Cron expression with timezone guard and concurrency lock #

Execution wrapper with timeout, structured logs, and artifact pipeline #

Log rotation #

Verification & Smoke Test #

Failure Modes #

FAQ #

Related #