2026-05-17 19:13:57 -04:00
8 changed files with 1401 additions and 10 deletions
@@ -0,0 +1,255 @@
+# Runbook — Recovering a wedged unit stuck in a call-home loop
+
+**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
+
+A field unit with a stuck-triggered geophone (or any hardware fault causing
+constant event triggering) will record events back-to-back, and if Auto Call
+Home is set to "After Event Recorded" the device will dial the office BW
+ACH server in a tight loop. Combined with a Sierra Wireless modem in
+bidirectional serial-TCP mode, this makes the unit effectively unreachable
+from SFM — every TCP connection we open gets killed when the modem flips
+from server-mode to client-mode to honor the device's next AT dial command.
+
+This runbook describes how to break the loop and recover control.
+
+---
+
+## Symptoms
+
+- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
+- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
+  waiting for POLL response) or 503 (TCP connect refused).
+- ACEmanager serial log shows repeating
+  `Connect to IP: <BW_IP> Port: <BW_PORT>` → `Shutdown TCP socket` cycles
+  every 30-60 seconds.
+- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
+  `sent_ok` but the device's monitoring state never changes.
+- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
+  but before completing the drip loop.
+
+If you see *all* of these, the unit is in this exact failure mode.
+
+---
+
+## Quick reference — how to recover
+
+You need **ACEmanager access** to the unit's modem.
+
+### Step 1: stop the modem's mode-flipping
+
+In ACEmanager → **Serial → Port Configuration**:
+
+| Field | Set to |
+|---|---|
+| **Destination Address** | clear (blank) |
+| **Destination Port** | `0` |
+
+Click **Apply**. This removes the modem's auto-dial-out target. The device's
+AT dial commands now error back at the modem instead of triggering a
+mode-flip, so the modem stays in TCP-server mode permanently and our inbound
+TCP sessions stay alive.
+
+*(Optional belt-and-suspenders: also add the BW server's port to
+**Security → Port Filtering - Outbound** as a blocked port, with
+Outbound Port Filtering Mode = Blocked Ports.)*
+
+### Step 2: stop monitoring on the device (slow drip)
+
+From the SFM host:
+
+```bash
+/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
+```
+
+Defaults are 120s duration with a drip every 3s. Watch the response:
+
+- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
+- `bytes_received > 0` → device is responding ✓ (this is the success signal)
+
+If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
+hold — re-check ACEmanager, may need to reboot the modem after Apply.
+
+### Step 3: confirm monitoring stopped
+
+```bash
+curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
+# expect: {"is_monitoring": false, ...}
+```
+
+### Step 4: disable ACH at the device level + erase corrupted events
+
+Either fire the rescue endpoint:
+
+```bash
+/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
+```
+
+Or do the two steps manually:
+
+```bash
+# Disable ACH in the device's compliance config
+curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
+  -H 'Content-Type: application/json' \
+  -d '{"auto_call_home_enabled": false}'
+
+# Erase corrupted event chain
+curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
+```
+
+You can also do this via the SFM standalone UI → **Call Home** tab → set
+`Enable Auto Call Home` to `Disabled` → **Write to Device**.
+
+### Step 5: restore modem config (housekeeping)
+
+Once the device-side ACH is disabled, restore the modem's Destination
+Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
+ACEmanager. The modem will resume normal bidirectional behavior, but the
+unit won't issue any dial commands until ACH is explicitly re-enabled on
+the device.
+
+### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
+fault is repaired. If you do, the call-home loop starts again immediately
+and you'll be running this runbook a second time.
+
+---
+
+## Why this works — the failure mode explained
+
+The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
+at any moment:
+
+- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
+  TCP to the device's serial port. This is what we need to interact with
+  the device.
+- **Client mode** — when the device sends an AT dial command on its serial
+  TX line, the modem opens an outbound TCP to `Destination Address:Port`
+  and bridges that to serial.
+
+A serial port in this configuration is **bidirectional**: the modem flips
+between server and client modes on demand. When the device's firmware is
+healthy and only dials occasionally, this works fine.
+
+When the unit is constantly triggering events and ACH is set to "After
+Event Recorded", the device sends an AT dial command every few seconds.
+Each one causes the modem to:
+
+1. Drop any active inbound TCP session
+2. Flip to client mode
+3. Attempt outbound TCP to `Destination Address:Port`
+4. Hang for up to a minute waiting for it to succeed/fail
+5. Drop back to server mode
+
+**During the entire hang, no inbound TCP can establish.** Even between
+hangs, the modem closes any existing inbound session before flipping. So
+any tool that needs more than a few seconds of held TCP (e.g. POLL +
+config read + write) gets repeatedly kicked off.
+
+Clearing `Destination Address` removes step 3-4 from the cycle: the modem
+has nowhere to dial, so it doesn't flip modes when it receives an AT dial
+command. The serial port effectively becomes server-only, and inbound TCP
+sessions can stay open as long as needed.
+
+**This is a modem-layer issue, not a device firmware issue.** The device
+is alive and responsive the whole time — confirmed in the BE9558H
+recovery by 990 bytes of S3 responses received over a 120s slow-drip
+session once the modem was no longer mode-flipping.
+
+---
+
+## Why simpler approaches don't work
+
+| Approach | Why it fails |
+|---|---|
+| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
+| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
+| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
+| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
+| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
+| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
+
+The combination of `slow_drip` + cleared `Destination Address` works because:
+
+1. The modem stops mode-flipping → TCP session stays open for the full
+   drip duration
+2. Slow drip rate → device's UART RX FIFO never overflows even if
+   firmware is busy with event recording
+3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
+   independent chances for the parser to land one valid frame
+4. Once one Stop Monitoring is parsed, event recording halts → firmware
+   has CPU to spare → subsequent operations are trivially easy
+
+---
+
+## Tooling reference
+
+All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
+`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
+overridable via `SFM_BASE_URL`.
+
+### Endpoints added during BE9558H recovery
+
+| Endpoint | Purpose |
+|---|---|
+| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
+| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
+| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
+| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
+| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
+| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
+| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
+
+Also changed: default protocol recv timeout dropped from 30s → 10s in
+`_build_client`. Added `connect_timeout` knob to same. Cleaned up
+unhandled-exception path in `/device/monitor/status` so it returns 502
+instead of 500 on protocol timeouts.
+
+### Scripts
+
+| Script | Purpose |
+|---|---|
+| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
+| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
+| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
+| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
+| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
+
+---
+
+## Incident log — BE9558H, 2026-05-16/17
+
+What was wrong: Long-axis geophone developed an offset, constantly above
+trigger threshold → constant event recording → after-event ACH set →
+modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
+Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
+
+Diagnostic path:
+
+1. `/device/info` slow, choked on event walk
+2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
+   but didn't reach the wedged unit
+3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
+4. Built `/device/stop_monitoring_blind` — first version was a false
+   positive (no POLL preamble); fixed by including
+   `SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
+5. Verified blind stop works on bench unit
+6. Built `/device/stop_monitoring_spam` — 420 successful sends over
+   5 min, zero behavior change on field unit
+7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
+   confirmed device was not fully locked up
+8. Added outbound port-12345 firewall block → outbound attempts now fail
+   instantly but contention persisted
+9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
+   broken pipe (modem closing on us)
+10. Looked at full ACEmanager Port Configuration → **found
+    `Destination Address: 50.197.32.92` configured**, realized every AT
+    dial command was triggering a modem mode-flip that killed our inbound
+11. Cleared Destination Address + Port → slow_drip held 120s, device
+    responded with 990 bytes, 39 stop commands acked
+12. Disabled ACH at device level via `/device/call_home`, erased events
+
+Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
+device level, modem destination cleared (to be restored after physical
+service).
+
+Total time from "i was wondering if its possible to" first attempt to
+recovery: ~7 hours of intermittent debugging across one evening.
@@ -0,0 +1,100 @@
+#!/usr/bin/env bash
+# Fire-and-forget Stop Monitoring loop — for wedged or constantly-triggering units.
+#
+# Hammers POST /device/stop_monitoring_blind in a tight loop.  The endpoint
+# opens TCP, dumps SESSION_RESET + a few copies of the SUB 0x97 frame, and
+# closes — without ever reading an S3 response.  Each TCP-won attempt is
+# ~50ms of wire activity instead of the multi-frame handshake the regular
+# rescue endpoint does, so windows that are too small for the full rescue
+# can still land a stop-monitoring command.
+#
+# Usage:
+#   ./blind_stop.sh <host> [tcp_port]
+#
+# Env:
+#   SFM_BASE_URL    Default: http://localhost:8200 (SFM direct).
+#                   Set to http://localhost:8001/api/sfm to route through
+#                   Terra-View's proxy.
+#   MAX_ATTEMPTS    Default: 600
+#   SLEEP_S         Default: 0  (no backoff — hammer it)
+#   MAX_TIME_S      Default: 15
+#   CONNECT_TIMEOUT Default: 5
+#   REPEAT          Frames per TCP session (default 3 — increases hit rate
+#                   if the device is busy reading its own buffer).
+#   STOP_ON_OK      Default: 1.  Set to 0 to keep hammering indefinitely
+#                   even after successful sends (every 503 means the device
+#                   is in *another* session, every 200 means our bytes got
+#                   through — but the device may not have processed them).
+
+set -u
+
+host="${1:-}"
+tcp_port="${2:-9034}"
+if [[ -z "$host" ]]; then
+  echo "usage: $0 <host> [tcp_port]" >&2
+  exit 2
+fi
+
+base="${SFM_BASE_URL:-http://localhost:8200}"
+max_attempts="${MAX_ATTEMPTS:-600}"
+sleep_s="${SLEEP_S:-0}"
+max_time_s="${MAX_TIME_S:-15}"
+connect_timeout="${CONNECT_TIMEOUT:-5}"
+repeat="${REPEAT:-3}"
+stop_on_ok="${STOP_ON_OK:-1}"
+
+url="${base}/device/stop_monitoring_blind?host=${host}&tcp_port=${tcp_port}&connect_timeout=${connect_timeout}&repeat=${repeat}"
+
+echo "blind_stop: target ${host}:${tcp_port}  connect_timeout=${connect_timeout}s  repeat=${repeat}"
+echo "blind_stop: POST ${url}"
+echo "blind_stop: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
+echo "blind_stop: stop_on_ok=${stop_on_ok}"
+echo
+
+ok_count=0
+busy_count=0
+err_count=0
+started=$(date +%s)
+
+for ((i=1; i<=max_attempts; i++)); do
+  printf "[%4d] %s  " "$i" "$(date +%H:%M:%S)"
+  http_code=$(curl -sS -o /tmp/blind_resp.$$ -w "%{http_code}" \
+    --max-time "$max_time_s" \
+    -X POST "$url" || echo "000")
+  body=$(cat /tmp/blind_resp.$$ 2>/dev/null || true)
+  rm -f /tmp/blind_resp.$$
+
+  case "$http_code" in
+    200|201)
+      ok_count=$((ok_count + 1))
+      echo "SENT  $body"
+      if [[ "$stop_on_ok" == "1" ]]; then
+        elapsed=$(( $(date +%s) - started ))
+        echo
+        echo "blind_stop: success after ${i} attempts (${elapsed}s).  ok=${ok_count} busy=${busy_count} err=${err_count}"
+        echo "blind_stop: NEXT — wait ~10s, then try the full rescue:"
+        echo "  /home/serversdown/seismo-relay/scripts/rescue_device.sh ${host} ${tcp_port}"
+        exit 0
+      fi
+      ;;
+    503)
+      busy_count=$((busy_count + 1))
+      echo "busy (503)"
+      ;;
+    000)
+      err_count=$((err_count + 1))
+      echo "curl error"
+      ;;
+    *)
+      err_count=$((err_count + 1))
+      echo "HTTP $http_code  $body" | head -c 400
+      echo
+      ;;
+  esac
+  [[ "$sleep_s" != "0" ]] && sleep "$sleep_s"
+done
+
+elapsed=$(( $(date +%s) - started ))
+echo
+echo "blind_stop: gave up after ${max_attempts} attempts (${elapsed}s).  ok=${ok_count} busy=${busy_count} err=${err_count}" >&2
+exit 1
@@ -0,0 +1,99 @@
+#!/usr/bin/env bash
+# Rescue an uncooperative MiniMate that's busy with another ACH session.
+#
+# Hammers POST /device/rescue in a tight loop with a short timeout.  When the
+# device is in an ACH session our SYN either gets refused or silently dropped
+# (5s connect timeout inside the endpoint) and we retry immediately.  When the
+# device is between sessions, our TCP wins, the endpoint disables Auto Call
+# Home and erases events inside the same session, then returns success.
+#
+# Usage:
+#   ./rescue_device.sh <host> [tcp_port] [--no-erase] [--no-disable-ach]
+#
+# Examples:
+#   ./rescue_device.sh 166.246.130.1 9034
+#   ./rescue_device.sh 166.246.130.1 9034 --no-erase     # just silence it
+#
+# Environment:
+#   SFM_BASE_URL    Defaults to http://localhost:8200 (SFM direct).
+#                   Set to http://localhost:8001/api/sfm to route through
+#                   Terra-View's proxy.  Direct mode avoids the proxy's
+#                   60s timeout, which matters for long-running endpoints.
+#   MAX_ATTEMPTS    Cap on retries (default 600 ≈ 30+ min).
+#   SLEEP_S         Backoff between attempts (default 1).
+#   MAX_TIME_S      Per-request timeout (default 60).
+#   CONNECT_TIMEOUT TCP connect timeout (default 5).
+#   RECV_TIMEOUT    Per-frame S3 recv timeout (default 5).  If POLL or any
+#                   subsequent frame doesn't respond within this window, the
+#                   rescue endpoint bails and this script retries.
+
+set -u
+
+host="${1:-}"
+tcp_port="${2:-9034}"
+shift 2 2>/dev/null || shift $# 2>/dev/null
+
+if [[ -z "$host" ]]; then
+  echo "usage: $0 <host> [tcp_port] [--no-erase] [--no-disable-ach]" >&2
+  exit 2
+fi
+
+disable_ach="true"
+erase="true"
+for arg in "$@"; do
+  case "$arg" in
+    --no-erase)        erase="false" ;;
+    --no-disable-ach)  disable_ach="false" ;;
+    *) echo "unknown flag: $arg" >&2; exit 2 ;;
+  esac
+done
+
+base="${SFM_BASE_URL:-http://localhost:8200}"
+max_attempts="${MAX_ATTEMPTS:-600}"
+sleep_s="${SLEEP_S:-1}"
+max_time_s="${MAX_TIME_S:-60}"
+connect_timeout="${CONNECT_TIMEOUT:-5}"
+recv_timeout="${RECV_TIMEOUT:-5}"
+
+url="${base}/device/rescue?host=${host}&tcp_port=${tcp_port}&disable_ach=${disable_ach}&erase=${erase}&connect_timeout=${connect_timeout}&recv_timeout=${recv_timeout}"
+
+echo "rescue: target ${host}:${tcp_port}  disable_ach=${disable_ach}  erase=${erase}"
+echo "rescue: connect_timeout=${connect_timeout}s  recv_timeout=${recv_timeout}s"
+echo "rescue: POST ${url}"
+echo "rescue: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
+echo
+
+started=$(date +%s)
+for ((i=1; i<=max_attempts; i++)); do
+  printf "[%3d] %s  " "$i" "$(date +%H:%M:%S)"
+  http_code=$(curl -sS -o /tmp/rescue_resp.$$ -w "%{http_code}" \
+    --max-time "$max_time_s" \
+    -X POST "$url" || echo "000")
+  body=$(cat /tmp/rescue_resp.$$ 2>/dev/null || true)
+  rm -f /tmp/rescue_resp.$$
+
+  case "$http_code" in
+    200|201)
+      elapsed=$(( $(date +%s) - started ))
+      echo "OK  (${elapsed}s total)"
+      echo "$body"
+      exit 0
+      ;;
+    503)
+      # Connection refused / timeout — device busy in another session.  Retry fast.
+      echo "busy (503)"
+      ;;
+    000)
+      echo "curl error (network)"
+      ;;
+    *)
+      echo "HTTP $http_code"
+      echo "  $body" | head -c 400
+      echo
+      ;;
+  esac
+  sleep "$sleep_s"
+done
+
+echo "rescue: gave up after ${max_attempts} attempts" >&2
+exit 1
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+# Hold a single TCP session open and drip stop-monitoring frames at a slow
+# rate, so the device's UART RX FIFO has time to drain between sends.
+#
+# Use when high-rate spam isn't landing — typically because the device's
+# firmware is too busy to drain its serial buffer fast enough and bytes
+# are being lost to UART overrun.
+#
+# Usage:
+#   ./slow_drip.sh <host> [tcp_port] [duration_s]
+#
+# Env:
+#   DURATION         Default: 120 (seconds; arg 3 overrides). Clamped 1..600.
+#   INTERVAL         Seconds between drip sends (default 3).  Lower = more
+#                    aggressive, more risk of FIFO overrun.  Higher = safer
+#                    but fewer total drips per duration.
+#   CONNECT_TIMEOUT  Default: 5
+#   SFM_BASE_URL     Default: http://localhost:8200 (SFM direct).
+
+set -u
+
+host="${1:-}"
+tcp_port="${2:-9034}"
+duration="${3:-${DURATION:-120}}"
+if [[ -z "$host" ]]; then
+  echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
+  exit 2
+fi
+
+base="${SFM_BASE_URL:-http://localhost:8200}"
+interval="${INTERVAL:-3}"
+connect_timeout="${CONNECT_TIMEOUT:-5}"
+
+url="${base}/device/stop_monitoring_slow_drip?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&interval_s=${interval}&connect_timeout=${connect_timeout}"
+
+echo "slow_drip: target ${host}:${tcp_port}  duration=${duration}s  interval=${interval}s  connect_timeout=${connect_timeout}s"
+echo "slow_drip: POST ${url}"
+echo
+
+# Give curl enough slack to wait out the duration plus a buffer
+max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 30 }')
+
+curl -sS --max-time "$max_time" -X POST "$url"
+echo
@@ -0,0 +1,48 @@
+#!/usr/bin/env bash
+# Hammer a device with blind stop-monitoring sessions as fast as possible.
+# Single HTTP call kicks off the burst inside SFM (no per-attempt HTTP
+# overhead).  Default: 10 seconds, ~500 ms per attempt = ~20 attempts/sec.
+#
+# Usage:
+#   ./spam_stop.sh <host> [tcp_port] [duration_s]
+#
+# Examples:
+#   ./spam_stop.sh 166.246.130.1                  # 10s burst
+#   ./spam_stop.sh 166.246.130.1 9034 30          # 30s burst
+#   DURATION=60 CONNECT_TIMEOUT=0.2 ./spam_stop.sh 166.246.130.1
+#
+# Env:
+#   SFM_BASE_URL     Default: http://localhost:8200 (SFM direct).
+#                    Set to http://localhost:8001/api/sfm to route through
+#                    Terra-View's proxy — but note the proxy has a 60s
+#                    timeout, so long bursts need direct mode.
+#   DURATION         Default: 10 (seconds; arg 3 overrides)
+#   CONNECT_TIMEOUT  Default: 0.5 (seconds)
+#   REPEAT           Default: 3   (stop frames per TCP session)
+
+set -u
+
+host="${1:-}"
+tcp_port="${2:-9034}"
+duration="${3:-${DURATION:-10}}"
+
+if [[ -z "$host" ]]; then
+  echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
+  exit 2
+fi
+
+base="${SFM_BASE_URL:-http://localhost:8200}"
+connect_timeout="${CONNECT_TIMEOUT:-0.5}"
+repeat="${REPEAT:-3}"
+
+url="${base}/device/stop_monitoring_spam?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&connect_timeout=${connect_timeout}&repeat=${repeat}"
+
+echo "spam_stop: target ${host}:${tcp_port}  duration=${duration}s  connect_timeout=${connect_timeout}s  repeat=${repeat}"
+echo "spam_stop: POST ${url}"
+echo
+
+# Give curl enough slack to wait out the duration plus a buffer
+max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 10 }')
+
+curl -sS --max-time "$max_time" -X POST "$url"
+echo
@@ -0,0 +1,58 @@
+#!/usr/bin/env bash
+# Passive monitor for a misbehaving unit.  Every INTERVAL seconds, attempts
+# a single short TCP probe + storage_range read and logs the result.  Designed
+# to run unattended for hours/days and tell you when the unit comes back.
+#
+# Usage:
+#   ./watch_unit.sh <host> [tcp_port]
+#
+# Env:
+#   INTERVAL    Seconds between checks (default 300 = 5 min)
+#   LOG_FILE    Append results here (default /tmp/watch_<host>.log)
+#   SFM_BASE_URL  Default: http://localhost:8200
+
+set -u
+
+host="${1:-}"
+tcp_port="${2:-9034}"
+if [[ -z "$host" ]]; then
+  echo "usage: $0 <host> [tcp_port]" >&2
+  exit 2
+fi
+
+interval="${INTERVAL:-300}"
+log_file="${LOG_FILE:-/tmp/watch_${host}.log}"
+base="${SFM_BASE_URL:-http://localhost:8200}"
+
+url="${base}/device/events/storage_range?host=${host}&tcp_port=${tcp_port}"
+
+echo "watch_unit: target ${host}:${tcp_port}  interval=${interval}s  log=${log_file}"
+echo "watch_unit: Ctrl-C to stop"
+
+while true; do
+  ts=$(date '+%Y-%m-%d %H:%M:%S')
+  http_code=$(curl -sS -o /tmp/watch_resp.$$ -w "%{http_code}" \
+    --max-time 20 "$url" || echo "000")
+  body=$(cat /tmp/watch_resp.$$ 2>/dev/null || true)
+  rm -f /tmp/watch_resp.$$
+
+  case "$http_code" in
+    200|201)
+      # Strip the raw_hex for readability
+      summary=$(echo "$body" | sed 's/"raw_hex":"[^"]*",*//; s/,*$//' | head -c 200)
+      echo "$ts  REACHABLE  $summary" | tee -a "$log_file"
+      ;;
+    502|503)
+      err=$(echo "$body" | head -c 150)
+      echo "$ts  ERROR_$http_code  $err" | tee -a "$log_file"
+      ;;
+    000)
+      echo "$ts  CURL_FAIL  (network/timeout)" | tee -a "$log_file"
+      ;;
+    *)
+      echo "$ts  HTTP_$http_code  $(echo "$body" | head -c 150)" | tee -a "$log_file"
+      ;;
+  esac
+
+  sleep "$interval"
+done
@@ -491,6 +491,75 @@ class SeismoDb:
            )
        return cur.rowcount > 0

+    def delete_event(self, event_id: str) -> Optional[dict]:
+        """
+        Hard-delete one event row by id.  Returns the deleted row (so the
+        caller can clean up any on-disk files referenced by it) or None
+        if no row matched.
+        """
+        with self._connect() as conn:
+            row = conn.execute(
+                "SELECT * FROM events WHERE id = ?", (event_id,),
+            ).fetchone()
+            if row is None:
+                return None
+            conn.execute("DELETE FROM events WHERE id = ?", (event_id,))
+        return dict(row)
+
+    def delete_events_bulk(
+        self,
+        serial: Optional[str] = None,
+        from_dt: Optional[datetime.datetime] = None,
+        to_dt: Optional[datetime.datetime] = None,
+        false_trigger: Optional[bool] = None,
+        ids: Optional[list[str]] = None,
+    ) -> list[dict]:
+        """
+        Hard-delete events matching the given filters.  Returns the list
+        of deleted row dicts.  Refuses to delete with no filters at all
+        (would wipe the whole table) — raises ValueError.
+
+        Filter semantics match query_events: serial / from_dt / to_dt /
+        false_trigger combine with AND.  `ids` is an additional inclusion
+        list (event_id IN (...)); if supplied alongside other filters,
+        only rows matching all conditions are deleted.
+        """
+        clauses: list[str] = []
+        params:  list      = []
+
+        if serial:
+            clauses.append("serial = ?")
+            params.append(serial)
+        if from_dt:
+            clauses.append("timestamp >= ?")
+            params.append(from_dt.isoformat())
+        if to_dt:
+            clauses.append("timestamp <= ?")
+            params.append(to_dt.isoformat())
+        if false_trigger is not None:
+            clauses.append("false_trigger = ?")
+            params.append(1 if false_trigger else 0)
+        if ids:
+            placeholders = ",".join("?" * len(ids))
+            clauses.append(f"id IN ({placeholders})")
+            params.extend(ids)
+
+        if not clauses:
+            raise ValueError(
+                "delete_events_bulk refuses to delete with no filters "
+                "(would wipe the entire events table)"
+            )
+
+        where = "WHERE " + " AND ".join(clauses)
+
+        with self._connect() as conn:
+            rows = conn.execute(
+                f"SELECT * FROM events {where}", params,
+            ).fetchall()
+            if rows:
+                conn.execute(f"DELETE FROM events {where}", params)
+        return [dict(r) for r in rows]
+
    def update_event_review(self, event_id: str, review: dict) -> bool:
        """
        Sync derived index columns from a sidecar's `review` block.
@@ -36,6 +36,7 @@ from __future__ import annotations

 import datetime
 import logging
+import socket
 import sys
 import tempfile
 import threading
@@ -63,7 +64,9 @@ from minimateplus.protocol import ProtocolError
 from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp
 from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT
 from minimateplus.blastware_file import write_blastware_file, blastware_filename
-from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform
+from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform, _decode_event_count
+from minimateplus.framing import build_bw_write_frame, SESSION_RESET, POLL_PROBE, POLL_DATA
+from minimateplus.protocol import SUB_STOP_MONITORING
 from sfm import event_hdf5
 from sfm.cache import SFMCache, get_cache
 from sfm.database import SeismoDb
@@ -268,7 +271,8 @@ def _build_client(
    baud: int,
    host: Optional[str],
    tcp_port: int,
-    timeout: float = 30.0,
+    timeout: float = 10.0,
+    connect_timeout: Optional[float] = None,
 ) -> MiniMateClient:
    """
    Return a MiniMateClient configured for either serial or TCP transport.
@@ -276,11 +280,23 @@ def _build_client(
    TCP takes priority if *host* is supplied; otherwise *port* (serial) is used.
    Raises HTTPException(422) if neither is provided.

+    Default *timeout* is 10s — the device usually responds in well under a
+    second over cellular; 10s leaves comfortable headroom for retransmits
+    while still failing reasonably fast when a unit is wedged.
+
    Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform
    download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over
    cellular and each individual recv must complete within the timeout window.
+
+    *connect_timeout* (TCP only) overrides the TcpTransport default (10s) for
+    the initial TCP SYN/handshake.  Use a small value (e.g. 5s) in rescue/race
+    scenarios where the device is busy in another session and you want to
+    fail fast and retry quickly.
    """
    if host:
+        if connect_timeout is not None:
+            transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
+        else:
            transport = TcpTransport(host, port=tcp_port)
        log.debug("TCP transport: %s:%d  timeout=%.0fs", host, tcp_port, timeout)
        return MiniMateClient(transport=transport, timeout=timeout)
@@ -1095,6 +1111,7 @@ def device_monitor_status(
            cached["_cached"] = True
            return cached

+    try:
        with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
            try:
                client.poll()
@@ -1102,6 +1119,15 @@ def device_monitor_status(
                log.warning("monitor status poll retry: %s", exc)
                client.poll()
            status = client.get_monitor_status()
+    except HTTPException:
+        raise
+    except ProtocolError as exc:
+        # Includes minimateplus.protocol.TimeoutError ("device unresponsive").
+        raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
+    except OSError as exc:
+        raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc

    result: dict = {"is_monitoring": status.is_monitoring}
    if status.battery_v is not None:
@@ -1117,6 +1143,529 @@ def device_monitor_status(
    return result


+@app.get("/device/events/storage_range")
+def device_events_storage_range(
+    port:     Optional[str] = Query(None,             description="Serial port (e.g. COM5)"),
+    baud:     int           = Query(38400,             description="Serial baud rate"),
+    host:     Optional[str] = Query(None,             description="TCP host — modem IP or ACH relay"),
+    tcp_port: int           = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+) -> dict:
+    """
+    Read the device's event storage range (SUB 0x06) — first and last
+    stored event keys.  POLL handshake + one read; no connect(), no
+    config reads, no event walk.  Completes in ~2 seconds.
+
+    Useful for checking whether the device has any stored events
+    without invoking the slow count_events() 1E/1F chain.  Both keys =
+    `01110000` means the device is empty.
+    """
+    log.info("GET /device/events/storage_range  host=%s tcp_port=%s", host, tcp_port)
+    try:
+        def _do():
+            with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
+                try:
+                    client.poll()
+                except Exception as exc:
+                    log.warning("storage_range poll retry: %s", exc)
+                    client.poll()
+                proto = client._require_proto()
+                return proto.read_event_storage_range()
+        rng = _run_with_retry(_do, is_tcp=_is_tcp(host))
+    except HTTPException:
+        raise
+    except ProtocolError as exc:
+        raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
+    except OSError as exc:
+        raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
+
+    data = bytes(rng.data)
+    result: dict = {"raw_len": len(data), "raw_hex": data.hex()}
+    if len(data) >= 8:
+        first_key = data[-8:-4].hex()
+        last_key  = data[-4:].hex()
+        result["first_key"] = first_key
+        result["last_key"]  = last_key
+        result["is_empty"]  = (first_key == "01110000" and last_key == "01110000")
+    return result
+
+
+@app.get("/device/events/index")
+def device_events_index(
+    port:     Optional[str] = Query(None,             description="Serial port (e.g. COM5)"),
+    baud:     int           = Query(38400,             description="Serial baud rate"),
+    host:     Optional[str] = Query(None,             description="TCP host — modem IP or ACH relay"),
+    tcp_port: int           = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+) -> dict:
+    """
+    Read the device's event index (SUB 0x08) — returns the lifetime
+    event counter at data[10:12] (uint16 BE).  POLL handshake + one
+    read; no connect(), no config reads, no event walk.  ~2 seconds.
+
+    Note: this is a LIFETIME counter (events ever recorded) — it does
+    NOT decrement when events are erased.  After an erase, the device
+    counter resets to 0 only on the next recorded event.  For "are
+    there stored events right now?" use /device/events/storage_range
+    instead.
+    """
+    log.info("GET /device/events/index  host=%s tcp_port=%s", host, tcp_port)
+    try:
+        def _do():
+            with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
+                try:
+                    client.poll()
+                except Exception as exc:
+                    log.warning("event_index poll retry: %s", exc)
+                    client.poll()
+                proto = client._require_proto()
+                return proto.read_event_index()
+        idx_raw = _run_with_retry(_do, is_tcp=_is_tcp(host))
+    except HTTPException:
+        raise
+    except ProtocolError as exc:
+        raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
+    except OSError as exc:
+        raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
+
+    raw = bytes(idx_raw)
+    result: dict = {"raw_len": len(raw), "raw_hex": raw.hex()}
+    try:
+        result["lifetime_count"] = _decode_event_count(raw)
+    except Exception as exc:
+        result["decode_error"] = str(exc)
+    return result
+
+
+@app.post("/device/events/erase")
+def device_events_erase(
+    port:     Optional[str] = Query(None,             description="Serial port (e.g. COM5)"),
+    baud:     int           = Query(38400,             description="Serial baud rate"),
+    host:     Optional[str] = Query(None,             description="TCP host — modem IP or ACH relay"),
+    tcp_port: int           = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+) -> dict:
+    """
+    Erase ALL stored events from the device memory.
+
+    Sequence: SUB 0xA3 → 0x1C → 0x06 → 0xA2 (confirmed 2026-04-11).
+    After this call the unit's event memory is empty and event keys reset
+    to 0x01110000.  The device returns to its normal operating state
+    automatically — no restart-monitoring call is needed.
+
+    Note: this endpoint does NOT touch the ACH server's `ach_state.json`.
+    If a call-home subsequently lands on the ACH server, its post-erase
+    detection logic (max(device_keys) vs max_downloaded_key) handles the
+    key-counter rollback.
+    """
+    log.info("POST /device/events/erase  port=%s host=%s tcp_port=%s", port, host, tcp_port)
+
+    try:
+        def _do():
+            with _build_client(port, baud, host, tcp_port) as client:
+                client.connect()
+                client.delete_all_events()
+        _run_with_retry(_do, is_tcp=_is_tcp(host))
+    except HTTPException:
+        raise
+    except ProtocolError as exc:
+        raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
+    except OSError as exc:
+        raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
+    except Exception as exc:
+        raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
+
+    conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
+    cleared = get_cache().clear_device(conn_key)
+    return {
+        "status": "ok",
+        "message": "Device event memory cleared",
+        "cache_cleared": cleared,
+    }
+
+
+@app.post("/device/stop_monitoring_blind")
+def device_stop_monitoring_blind(
+    host:            str   = Query(...,            description="TCP host — modem IP"),
+    tcp_port:        int   = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+    connect_timeout: float = Query(5.0,             description="TCP connect timeout in seconds (default 5)"),
+    repeat:          int   = Query(3,               description="How many times to send the frame within one TCP session (default 3)"),
+) -> dict:
+    """
+    Fire-and-forget Stop Monitoring (SUB 0x97).  TCP-only.
+
+    Opens a TCP session, dumps the FULL handshake the device's protocol
+    state machine expects — `SESSION_RESET + POLL_PROBE + SESSION_RESET +
+    POLL_DATA` — and then N back-to-back copies of the stop-monitoring
+    frame.  Does NOT read any S3 response.  Succeeds as long as the bytes
+    left the socket.
+
+    The POLL handshake bytes are required: monitoring units ignore command
+    frames received without a preceding POLL exchange.  Sending the POLL
+    bytes "blind" (without reading the responses) still works because the
+    device processes inbound bytes in order regardless of whether we drain
+    its outbound buffer.
+
+    Idempotent: the device processes extra copies of SUB 0x97 the same as
+    one (already-stopped is a no-op).
+
+    Returns the number of bytes sent.  A 503 means the TCP connect failed
+    (device busy in another session — caller should retry).
+    """
+    log.info(
+        "POST /device/stop_monitoring_blind  host=%s tcp_port=%s connect_timeout=%.1fs repeat=%d",
+        host, tcp_port, connect_timeout, repeat,
+    )
+    if repeat < 1:
+        repeat = 1
+
+    frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
+    payload = (
+        SESSION_RESET + POLL_PROBE
+        + SESSION_RESET + POLL_DATA
+        + (frame * repeat)
+    )
+    t0 = time.monotonic()
+
+    transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
+    try:
+        transport.connect()
+    except OSError as exc:
+        raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
+
+    try:
+        transport.write(payload)
+    except OSError as exc:
+        transport.disconnect()
+        raise HTTPException(status_code=502, detail=f"Send error: {exc}") from exc
+    finally:
+        transport.disconnect()
+
+    return {
+        "status": "sent",
+        "bytes_sent": len(payload),
+        "frame_size": len(frame),
+        "repeat": repeat,
+        "elapsed_s": round(time.monotonic() - t0, 3),
+    }
+
+
+@app.post("/device/stop_monitoring_slow_drip")
+def device_stop_monitoring_slow_drip(
+    host:            str   = Query(...,            description="TCP host — modem IP"),
+    tcp_port:        int   = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+    duration_s:      float = Query(120.0,           description="Total time to hold the session open (seconds)"),
+    interval_s:      float = Query(3.0,             description="Seconds between drip sends"),
+    connect_timeout: float = Query(5.0,             description="TCP connect timeout"),
+) -> dict:
+    """
+    Hold a single TCP session open for *duration_s* seconds and drip
+    stop-monitoring frames into the device at a slow rate so its UART
+    RX FIFO has time to drain between sends.
+
+    Sequence:
+      1. Open TCP session.
+      2. Send the wake preamble: SESSION_RESET + POLL_PROBE +
+         SESSION_RESET + POLL_DATA  (so the device's protocol parser
+         is primed for a write command).
+      3. Wait interval_s for the device to drain.
+      4. Drip-send (SESSION_RESET + stop_monitoring_frame) every
+         interval_s until duration_s elapses.
+      5. Opportunistically drain any bytes the device sends back (so
+         the modem's TX queue doesn't fill up).  Successful drains are
+         counted in `bytes_received` — non-zero strongly suggests the
+         device has started responding to us.
+      6. Close.
+
+    Designed for units whose firmware is too busy with event-recording
+    to keep up with high-rate spam.  Heavy spam overruns the UART FIFO;
+    slow drip stays under it.
+
+    Compared to spam mode: ~40× fewer bytes/sec on the wire, but each
+    byte has a much higher chance of actually being parsed.
+    """
+    log.info(
+        "POST /device/stop_monitoring_slow_drip  host=%s tcp_port=%s duration=%.1fs interval=%.2fs connect_timeout=%.1fs",
+        host, tcp_port, duration_s, interval_s, connect_timeout,
+    )
+    duration_s = max(1.0, min(duration_s, 600.0))    # clamp 1s..10min
+    interval_s = max(0.1, min(interval_s, 30.0))
+    connect_timeout = max(0.1, connect_timeout)
+
+    stop_frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
+    preamble = (
+        SESSION_RESET + POLL_PROBE
+        + SESSION_RESET + POLL_DATA
+    )
+
+    t0 = time.monotonic()
+    drips_sent = 0
+    bytes_sent = 0
+    bytes_received = 0
+
+    try:
+        sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
+    except OSError as exc:
+        raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
+
+    # Short read timeout so opportunistic drains don't block.
+    sock.settimeout(0.1)
+
+    try:
+        # Initial wake preamble.
+        try:
+            sock.sendall(preamble)
+            bytes_sent += len(preamble)
+        except OSError as exc:
+            raise HTTPException(status_code=502, detail=f"Preamble send failed: {exc}") from exc
+
+        # Initial settle.
+        time.sleep(interval_s)
+
+        # Try a non-blocking drain of any response to the wake.
+        try:
+            data = sock.recv(4096)
+            if data:
+                bytes_received += len(data)
+                log.info("slow_drip: device responded to wake preamble (%d bytes)", len(data))
+        except socket.timeout:
+            pass
+        except OSError:
+            pass
+
+        deadline = t0 + duration_s
+        drip = SESSION_RESET + stop_frame   # 2 + 21 = 23 bytes per drip
+        send_error: Optional[str] = None
+
+        while time.monotonic() < deadline:
+            try:
+                sock.sendall(drip)
+                bytes_sent += len(drip)
+                drips_sent += 1
+            except OSError as exc:
+                send_error = f"{exc}"
+                log.warning("slow_drip: send failed after %d drips: %s", drips_sent, exc)
+                break
+
+            # Drain any inbound bytes; ignore timeouts.
+            try:
+                data = sock.recv(4096)
+                if data:
+                    bytes_received += len(data)
+            except socket.timeout:
+                pass
+            except OSError:
+                pass
+
+            # Sleep the interval, but don't oversleep past the deadline.
+            remaining = deadline - time.monotonic()
+            if remaining <= 0:
+                break
+            time.sleep(min(interval_s, remaining))
+    finally:
+        try:
+            sock.shutdown(socket.SHUT_RDWR)
+        except OSError:
+            pass
+        try:
+            sock.close()
+        except OSError:
+            pass
+
+    elapsed = time.monotonic() - t0
+    log.info(
+        "slow_drip done — drips=%d bytes_sent=%d bytes_received=%d in %.1fs",
+        drips_sent, bytes_sent, bytes_received, elapsed,
+    )
+    return {
+        "status": "done",
+        "duration_s": round(elapsed, 2),
+        "drips_sent": drips_sent,
+        "bytes_sent": bytes_sent,
+        "bytes_received": bytes_received,
+        "preamble_bytes": len(preamble),
+        "drip_bytes": len(drip),
+        "send_error": send_error,
+    }
+
+
+@app.post("/device/stop_monitoring_spam")
+def device_stop_monitoring_spam(
+    host:            str   = Query(...,            description="TCP host — modem IP"),
+    tcp_port:        int   = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+    duration_s:      float = Query(10.0,             description="How long to hammer the device for (seconds)"),
+    connect_timeout: float = Query(0.5,              description="Per-attempt TCP connect timeout (default 0.5s)"),
+    repeat:          int   = Query(3,                description="Stop frames per TCP session (default 3)"),
+) -> dict:
+    """
+    Hammer the device with blind stop-monitoring sessions as fast as
+    possible for `duration_s` seconds.  Each attempt: open TCP → write
+    SESSION_RESET + POLL handshake + STOP frames × repeat → close.  No
+    response is read.
+
+    Designed for units that are aggressively calling home — short
+    connect_timeout (default 500 ms) means every failed attempt loses
+    only that much time before retrying, so we can fit several attempts
+    per second even when the modem is mostly busy with its own outbound
+    sessions.
+
+    Single HTTP call kicks off the whole burst; counters are returned
+    when it finishes.  No streaming; if you want live progress, watch
+    SFM logs.
+    """
+    log.info(
+        "POST /device/stop_monitoring_spam  host=%s tcp_port=%s duration=%.1fs connect_timeout=%.3fs repeat=%d",
+        host, tcp_port, duration_s, connect_timeout, repeat,
+    )
+    if repeat < 1:
+        repeat = 1
+    duration_s = max(0.1, min(duration_s, 300.0))   # clamp 0.1s..5min
+    connect_timeout = max(0.05, connect_timeout)
+
+    frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
+    payload = (
+        SESSION_RESET + POLL_PROBE
+        + SESSION_RESET + POLL_DATA
+        + (frame * repeat)
+    )
+
+    t0 = time.monotonic()
+    deadline = t0 + duration_s
+    sent_ok = 0
+    connect_failed = 0
+    write_failed = 0
+
+    while time.monotonic() < deadline:
+        try:
+            sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
+        except OSError:
+            connect_failed += 1
+            continue
+        try:
+            sock.sendall(payload)
+            sent_ok += 1
+        except OSError:
+            write_failed += 1
+        finally:
+            try:
+                sock.shutdown(socket.SHUT_RDWR)
+            except OSError:
+                pass
+            try:
+                sock.close()
+            except OSError:
+                pass
+
+    elapsed = time.monotonic() - t0
+    total = sent_ok + connect_failed + write_failed
+    log.info(
+        "stop_monitoring_spam done — sent=%d connect_failed=%d write_failed=%d in %.2fs",
+        sent_ok, connect_failed, write_failed, elapsed,
+    )
+    return {
+        "status": "done",
+        "duration_s": round(elapsed, 2),
+        "sent_ok": sent_ok,
+        "connect_failed": connect_failed,
+        "write_failed": write_failed,
+        "total_attempts": total,
+        "rate_attempts_per_s": round(total / elapsed, 1) if elapsed > 0 else 0,
+        "payload_bytes": len(payload),
+    }
+
+
+@app.post("/device/rescue")
+def device_rescue(
+    port:            Optional[str] = Query(None,             description="Serial port (e.g. COM5)"),
+    baud:            int           = Query(38400,             description="Serial baud rate"),
+    host:            Optional[str] = Query(None,             description="TCP host — modem IP or ACH relay"),
+    tcp_port:        int           = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
+    connect_timeout: float         = Query(5.0,               description="TCP connect timeout in seconds (default 5)"),
+    recv_timeout:    float         = Query(5.0,               description="Per-frame S3 recv timeout in seconds (default 5)"),
+    disable_ach:     bool          = Query(True,              description="Disable Auto Call Home on the device before erasing"),
+    erase:           bool          = Query(True,              description="Erase all stored events after disabling ACH"),
+) -> dict:
+    """
+    Rescue an uncooperative unit by squeezing all maintenance work into a
+    single TCP session.
+
+    Designed for devices that are actively calling home to a separate ACH
+    server (BW or otherwise).  While we hold this TCP session open the
+    modem cannot accept an inbound ACH call, so the order matters:
+
+      1. Short-timeout TCP connect (fails fast if the device is busy in
+         another session — the caller should retry in a tight loop).
+      2. POLL handshake.
+      3. (optional) Write call_home config with auto_call_home_enabled=false
+         so the device stops calling out even after we drop the session.
+      4. (optional) Erase all stored events (0xA3 → 0x1C → 0x06 → 0xA2).
+      5. Close the TCP session.
+
+    Both `disable_ach` and `erase` default to true.  Pass `?erase=false` if
+    you only want to silence the unit without wiping its events.
+
+    Caller pattern (bash):
+
+        until curl -sS --max-time 30 -X POST \\
+          "http://localhost:8001/api/sfm/device/rescue?host=$IP&tcp_port=$P"; do
+            sleep 1
+        done
+    """
+    log.info(
+        "POST /device/rescue  host=%s tcp_port=%s connect_timeout=%.1fs recv_timeout=%.1fs disable_ach=%s erase=%s",
+        host, tcp_port, connect_timeout, recv_timeout, disable_ach, erase,
+    )
+
+    steps: list[dict] = []
+    t0 = time.monotonic()
+
+    try:
+        with _build_client(
+            port, baud, host, tcp_port,
+            timeout=recv_timeout,
+            connect_timeout=connect_timeout,
+        ) as client:
+            steps.append({"step": "tcp_connect", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
+
+            try:
+                client.poll()
+            except Exception as exc:
+                log.warning("rescue: poll retry: %s", exc)
+                client.poll()
+            steps.append({"step": "poll", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
+
+            if disable_ach:
+                client.set_call_home_config(auto_call_home_enabled=False)
+                steps.append({"step": "disable_ach", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
+
+            if erase:
+                client.delete_all_events()
+                steps.append({"step": "erase", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
+
+    except ProtocolError as exc:
+        steps.append({"step": "error", "ok": False, "detail": f"protocol: {exc}"})
+        raise HTTPException(status_code=502, detail={"message": f"Protocol error: {exc}", "steps": steps}) from exc
+    except OSError as exc:
+        steps.append({"step": "error", "ok": False, "detail": f"socket: {exc}"})
+        # Connection refused / timed out → device busy in another session.  Caller should retry.
+        raise HTTPException(status_code=503, detail={"message": f"Connection error: {exc}", "steps": steps}) from exc
+    except Exception as exc:
+        steps.append({"step": "error", "ok": False, "detail": str(exc)})
+        raise HTTPException(status_code=500, detail={"message": f"Device error: {exc}", "steps": steps}) from exc
+
+    conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
+    cleared = get_cache().clear_device(conn_key)
+    return {
+        "status": "ok",
+        "elapsed_s": round(time.monotonic() - t0, 2),
+        "disable_ach": disable_ach,
+        "erase": erase,
+        "steps": steps,
+        "cache_cleared": cleared,
+    }
+
+
@app.post("/device/monitor/start")
 def device_monitor_start(
    port:     Optional[str] = Query(None,             description="Serial port (e.g. COM5)"),
@@ -1403,6 +1952,175 @@ def db_set_false_trigger(
    return {"status": "ok", "event_id": event_id, "false_trigger": value}


+def _cleanup_event_files(row: dict) -> dict:
+    """
+    Best-effort cleanup of on-disk waveform / sidecar / pickle / hdf5 files
+    associated with a deleted event row.  Returns a dict of {kind: bool} for
+    what was actually removed (true) vs not found / failed (false).
+    """
+    serial   = row.get("serial")
+    bw_name  = row.get("blastware_filename")
+    a5_name  = row.get("a5_pickle_filename")
+    sc_name  = row.get("sidecar_filename")
+    removed: dict = {}
+    if not serial:
+        return removed
+    store = _get_store()
+    # blastware_filename is the "base" — other files derive their paths from it
+    # via WaveformStore helpers.  Sidecar and a5 may also be stored under their
+    # own column values if they ever diverged historically.
+    base_name = bw_name or a5_name or sc_name
+    if base_name:
+        bw_path, a5_path = store.paths_for(serial, base_name)
+        sc_path = store.sidecar_path_for(serial, base_name)
+        h5_path = store.hdf5_path_for(serial, base_name)
+        for kind, p in [("blastware", bw_path), ("a5_pickle", a5_path),
+                        ("sidecar", sc_path), ("hdf5", h5_path)]:
+            try:
+                if p.exists():
+                    p.unlink()
+                    removed[kind] = True
+            except OSError as exc:
+                log.warning("file cleanup failed for %s (%s): %s", p, kind, exc)
+                removed[kind] = False
+    return removed
+
+
+@app.delete("/db/events/{event_id}")
+def db_delete_event(event_id: str) -> dict:
+    """
+    Hard-delete a single event from the SFM events table and remove any
+    associated on-disk waveform/sidecar/pickle/hdf5 files.
+
+    Returns 404 if the event_id is not found.
+    """
+    log.info("DELETE /db/events/%s", event_id)
+    deleted = _get_db().delete_event(event_id)
+    if deleted is None:
+        raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
+    files_removed = _cleanup_event_files(deleted)
+    return {
+        "status": "ok",
+        "event_id": event_id,
+        "files_removed": files_removed,
+    }
+
+
+class BulkDeleteBody(BaseModel):
+    """Body for POST /db/events/delete_bulk."""
+    serial:        Optional[str]       = None
+    from_dt:       Optional[str]       = None     # ISO-8601
+    to_dt:         Optional[str]       = None     # ISO-8601
+    false_trigger: Optional[bool]      = None
+    ids:           Optional[list[str]] = None
+    confirm:       bool                = False
+    # Safety: when no `ids` are supplied, require this many max rows to
+    # actually be deleted; if the matched count exceeds it, the endpoint
+    # returns a dry-run-style summary instead.  Pass None to disable.
+    max_rows:      Optional[int]       = 10000
+
+
+@app.post("/db/events/delete_bulk")
+def db_delete_events_bulk(body: BulkDeleteBody) -> dict:
+    """
+    Hard-delete multiple events at once, by filter and/or by id list.
+
+    Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND,
+    matching the same semantics as `GET /db/events`.  `ids` is an additional
+    inclusion list.  At least one filter or non-empty `ids` MUST be supplied
+    — refusing to wipe the whole table.
+
+    Safety knobs:
+      - `confirm` MUST be `true` to actually delete.  When false (default),
+        returns the match count without deleting (dry-run).
+      - `max_rows` (default 10,000) caps how many rows can be deleted in one
+        call by-filter; if the match count exceeds it, the endpoint returns
+        a count summary without deleting.  Ignored when only `ids` is used.
+
+    Returns:
+      {
+        "status":           "ok" | "dry_run" | "too_many",
+        "matched":          <int>,
+        "deleted":          <int>,         # 0 unless status == "ok"
+        "files_removed":    <int>,         # total file unlink successes
+        "sample_serials":   [...],         # up to 5 distinct serials touched
+      }
+    """
+    log.info(
+        "POST /db/events/delete_bulk  serial=%s from=%s to=%s ft=%s ids=%d confirm=%s max=%s",
+        body.serial, body.from_dt, body.to_dt, body.false_trigger,
+        len(body.ids or []), body.confirm, body.max_rows,
+    )
+
+    from_parsed = datetime.datetime.fromisoformat(body.from_dt) if body.from_dt else None
+    to_parsed   = datetime.datetime.fromisoformat(body.to_dt)   if body.to_dt   else None
+
+    db = _get_db()
+
+    # Dry-run path: count matches without deleting.
+    rows = db.query_events(
+        serial=body.serial,
+        from_dt=from_parsed,
+        to_dt=to_parsed,
+        false_trigger=body.false_trigger,
+        limit=1_000_000,    # we want a true count, not a page
+        offset=0,
+    )
+    if body.ids:
+        id_set = set(body.ids)
+        rows = [r for r in rows if r["id"] in id_set]
+    matched = len(rows)
+    sample_serials = sorted({r.get("serial") for r in rows[:50] if r.get("serial")})[:5]
+
+    if not body.confirm:
+        return {
+            "status": "dry_run",
+            "matched": matched,
+            "deleted": 0,
+            "files_removed": 0,
+            "sample_serials": sample_serials,
+            "hint": "Set confirm=true in the request body to actually delete.",
+        }
+
+    if body.max_rows is not None and not body.ids and matched > body.max_rows:
+        return {
+            "status": "too_many",
+            "matched": matched,
+            "deleted": 0,
+            "files_removed": 0,
+            "sample_serials": sample_serials,
+            "hint": (
+                f"Matched {matched} > max_rows={body.max_rows}.  Either raise "
+                f"max_rows in the body, narrow the filter, or supply an "
+                f"explicit `ids` list."
+            ),
+        }
+
+    try:
+        deleted_rows = db.delete_events_bulk(
+            serial=body.serial,
+            from_dt=from_parsed,
+            to_dt=to_parsed,
+            false_trigger=body.false_trigger,
+            ids=body.ids,
+        )
+    except ValueError as exc:
+        raise HTTPException(status_code=422, detail=str(exc)) from exc
+
+    files_removed = 0
+    for row in deleted_rows:
+        result = _cleanup_event_files(row)
+        files_removed += sum(1 for ok in result.values() if ok)
+
+    return {
+        "status": "ok",
+        "matched": matched,
+        "deleted": len(deleted_rows),
+        "files_removed": files_removed,
+        "sample_serials": sample_serials,
+    }
+
+
 # ── /db/events/{id} — waveform file accessors ─────────────────────────────────
 #
 # These endpoints serve files from the persistent WaveformStore, so a Blastware