A weekly server-health one-shot: a shell script that summarizes disk / memory / failed-units / pending updates and emails it

You SSH into your server and run the same five commands every Sunday morning: df -h, free -h, systemctl --failed, apt list --upgradable, lastb. Five minutes of looking around to confirm nothing’s quietly broken. Multiply by N servers and it stops being a casual habit. The fix is a 60-line shell script that runs weekly, gathers the same five questions worth of output, formats it into a single email, and drops it in your inbox at 8 AM Monday.

Below is the actual script I run. It’s deliberately boring — no Prometheus, no Grafana, no Telegraf agent. Just the basic commands you’d run by hand, formatted into one email per server per week.

The script

#!/usr/bin/env bash
# /usr/local/sbin/weekly-health-report
# Cron: 30 7 * * 1   /usr/local/sbin/weekly-health-report
set -uo pipefail

REPORT=$(mktemp)
HOST=$(hostname -f)
DATE=$(date -u +"%Y-%m-%d %H:%M UTC")

{
echo "==== Weekly health: $HOST ===="
echo "$DATE"
echo

echo "--- Disk usage ---"
df -h --output=source,size,used,avail,pcent,target | grep -vE '^(tmpfs|devtmpfs|/run|udev)'
echo

echo "--- Memory ---"
free -h
echo

echo "--- CPU load ---"
uptime
echo

echo "--- Failed systemd units ---"
FAILED=$(systemctl --failed --no-legend --no-pager)
if [ -z "$FAILED" ]; then
  echo "(none)"
else
  echo "$FAILED"
fi
echo

echo "--- Pending package updates ---"
if command -v apt >/dev/null; then
  apt list --upgradable 2>/dev/null | tail -n +2 | head -30
  TOTAL=$(apt list --upgradable 2>/dev/null | tail -n +2 | wc -l)
  echo "($TOTAL total)"
elif command -v dnf >/dev/null; then
  dnf check-update --quiet 2>/dev/null
fi
echo

echo "--- Reboot required ---"
if [ -f /var/run/reboot-required ]; then
  echo "YES — see /var/run/reboot-required.pkgs"
  cat /var/run/reboot-required.pkgs 2>/dev/null | head
else
  echo "no"
fi
echo

echo "--- Last 5 successful logins ---"
last -F | head -5
echo

echo "--- Last 5 failed login attempts ---"
lastb -F 2>/dev/null | head -5 || echo "(/var/log/btmp not readable)"
echo

echo "--- Top 5 by RAM ---"
ps -eo rss,pid,user,comm --sort=-rss | head -6 \
  | awk 'NR==1 {printf "%-12s %-6s %-12s %s\n", $1, $2, $3, $4; next}
         {printf "%-12s %-6s %-12s %s\n", $1/1024" MB", $2, $3, $4}'
echo

echo "--- Top 5 by CPU ---"
ps -eo pcpu,pid,user,comm --sort=-pcpu | head -6
echo

echo "--- Disks SMART status (if smartctl present) ---"
if command -v smartctl >/dev/null; then
  for d in /dev/sd? /dev/nvme?n?; do
    [ -e "$d" ] || continue
    s=$(smartctl -H "$d" 2>/dev/null | awk '/SMART overall-health|SMART Health/ {print $NF; exit}')
    printf "%-15s %s\n" "$d" "${s:-unknown}"
  done
else
  echo "smartctl not installed"
fi
echo

echo "==== end ===="
} > "$REPORT"

# Send via msmtp / mailx / a webhook
if command -v msmtp >/dev/null; then
  {
    echo "From: server@$HOST"
    echo "To: you@example.com"
    echo "Subject: [$HOST] Weekly health $DATE"
    echo "Content-Type: text/plain; charset=utf-8"
    echo
    cat "$REPORT"
  } | msmtp -t
else
  # Fallback: post to a webhook
  curl -sS -X POST 'https://your-webhook' \
       -H 'Content-Type: text/plain' \
       --data-binary @"$REPORT"
fi

rm -f "$REPORT"

Why these specific checks

  • Disk usage. The single most-common cause of “the server stopped working overnight” is a full disk. Better to see “94% used” on Monday morning than to find out at 3 AM Wednesday.
  • Memory. Useful for spotting a leak that’s slowly consuming swap.
  • Failed systemd units. A unit that crash-looped during the night is silent if you don’t ask.
  • Pending updates. Lets you see “37 packages to upgrade, including kernel” and plan a maintenance window.
  • Reboot required. Ubuntu’s /var/run/reboot-required flag gets set after kernel/libc updates. Easy to forget.
  • Logins. Both successful and failed. Failed logins climbing into the thousands is a fail2ban/CrowdSec config check.
  • Top processes. Calibrates your sense of what’s normal — if mysqld suddenly used 2x more RAM than last week, you’d notice.
  • SMART. Drives often signal failure for weeks before they actually die. The weekly check catches it.

The mail-delivery layer

Don’t try to send via system cron MAILTO= — that path is broken on most modern servers (covered in a previous post). Two reliable paths:

  • msmtp + a real SMTP relay. Install msmtp-mta, configure ~/.msmtprc with credentials for SES / Mailgun / Brevo / Fastmail. Outbound on 587 with TLS. Authenticated, deliverable.
  • Webhook to a chat tool. Slack, Telegram (via bot), Discord (via webhook), Pushover, ntfy.sh. Curl POST the report as a code block. The advantage is unified visibility — all your servers’ weekly reports show up in one channel.
# ~/.msmtprc — minimal config for AWS SES
defaults
auth on
tls on
tls_starttls on

account ses
host email-smtp.us-east-1.amazonaws.com
port 587
from server@example.com
user AKIA...               # SES SMTP username
password BXX...            # SES SMTP password (NOT your AWS key)

account default : ses

Cron it for Monday morning

# /etc/crontab
30 7 * * 1 root /usr/local/sbin/weekly-health-report

07:30 UTC Monday is 8:30 AM in the UK, 9:30 AM Berlin, 1:00 PM IST — pick the offset that lands the report on your phone right after breakfast. The report itself takes ~3 seconds to generate; the email arrives within the minute.

Fleet-scale variant

If you have 5+ servers, getting 5 separate emails is noisy. Modify the script to POST the report to a central endpoint (a webhook, an S3 bucket, a Discord channel, an SQS queue), and read all the reports together on Monday morning. The relevant chunk:

# Replace the email block with:
curl -sS -X POST "https://discord.com/api/webhooks/.../..." \
     -H 'Content-Type: application/json' \
     -d "$(jq -n --arg c "\`\`\`$(cat $REPORT)\`\`\`" '{content:$c}')"

End-state: every Monday at 7:30 UTC, every server in your fleet self-reports in one Discord channel. You scroll through them with coffee. Anything alarming jumps out (red disk, failed unit, SMART fail). Five minutes of attention covers the entire week of “is everything OK.” Compare to the alternative of “log in to each box, run the same commands, hope nothing was off” — this is one cron entry plus 60 lines of shell.

Photo: Analytics dashboard with charts by Negative Space on Pexels.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.