merge: update to 0.17.0 #21

Merged
serversdown merged 5 commits from ach-report-ingestion into main 2026-05-17 19:13:57 -04:00
8 changed files with 1401 additions and 10 deletions
Showing only changes of commit 1fff8179d6 - Show all commits
+255
View File
@@ -0,0 +1,255 @@
# Runbook — Recovering a wedged unit stuck in a call-home loop
**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
A field unit with a stuck-triggered geophone (or any hardware fault causing
constant event triggering) will record events back-to-back, and if Auto Call
Home is set to "After Event Recorded" the device will dial the office BW
ACH server in a tight loop. Combined with a Sierra Wireless modem in
bidirectional serial-TCP mode, this makes the unit effectively unreachable
from SFM — every TCP connection we open gets killed when the modem flips
from server-mode to client-mode to honor the device's next AT dial command.
This runbook describes how to break the loop and recover control.
---
## Symptoms
- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
waiting for POLL response) or 503 (TCP connect refused).
- ACEmanager serial log shows repeating
`Connect to IP: <BW_IP> Port: <BW_PORT>``Shutdown TCP socket` cycles
every 30-60 seconds.
- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
`sent_ok` but the device's monitoring state never changes.
- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
but before completing the drip loop.
If you see *all* of these, the unit is in this exact failure mode.
---
## Quick reference — how to recover
You need **ACEmanager access** to the unit's modem.
### Step 1: stop the modem's mode-flipping
In ACEmanager → **Serial → Port Configuration**:
| Field | Set to |
|---|---|
| **Destination Address** | clear (blank) |
| **Destination Port** | `0` |
Click **Apply**. This removes the modem's auto-dial-out target. The device's
AT dial commands now error back at the modem instead of triggering a
mode-flip, so the modem stays in TCP-server mode permanently and our inbound
TCP sessions stay alive.
*(Optional belt-and-suspenders: also add the BW server's port to
**Security → Port Filtering - Outbound** as a blocked port, with
Outbound Port Filtering Mode = Blocked Ports.)*
### Step 2: stop monitoring on the device (slow drip)
From the SFM host:
```bash
/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
```
Defaults are 120s duration with a drip every 3s. Watch the response:
- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
- `bytes_received > 0` → device is responding ✓ (this is the success signal)
If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
hold — re-check ACEmanager, may need to reboot the modem after Apply.
### Step 3: confirm monitoring stopped
```bash
curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
# expect: {"is_monitoring": false, ...}
```
### Step 4: disable ACH at the device level + erase corrupted events
Either fire the rescue endpoint:
```bash
/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
```
Or do the two steps manually:
```bash
# Disable ACH in the device's compliance config
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
-H 'Content-Type: application/json' \
-d '{"auto_call_home_enabled": false}'
# Erase corrupted event chain
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
```
You can also do this via the SFM standalone UI → **Call Home** tab → set
`Enable Auto Call Home` to `Disabled`**Write to Device**.
### Step 5: restore modem config (housekeeping)
Once the device-side ACH is disabled, restore the modem's Destination
Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
ACEmanager. The modem will resume normal bidirectional behavior, but the
unit won't issue any dial commands until ACH is explicitly re-enabled on
the device.
### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
fault is repaired. If you do, the call-home loop starts again immediately
and you'll be running this runbook a second time.
---
## Why this works — the failure mode explained
The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
at any moment:
- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
TCP to the device's serial port. This is what we need to interact with
the device.
- **Client mode** — when the device sends an AT dial command on its serial
TX line, the modem opens an outbound TCP to `Destination Address:Port`
and bridges that to serial.
A serial port in this configuration is **bidirectional**: the modem flips
between server and client modes on demand. When the device's firmware is
healthy and only dials occasionally, this works fine.
When the unit is constantly triggering events and ACH is set to "After
Event Recorded", the device sends an AT dial command every few seconds.
Each one causes the modem to:
1. Drop any active inbound TCP session
2. Flip to client mode
3. Attempt outbound TCP to `Destination Address:Port`
4. Hang for up to a minute waiting for it to succeed/fail
5. Drop back to server mode
**During the entire hang, no inbound TCP can establish.** Even between
hangs, the modem closes any existing inbound session before flipping. So
any tool that needs more than a few seconds of held TCP (e.g. POLL +
config read + write) gets repeatedly kicked off.
Clearing `Destination Address` removes step 3-4 from the cycle: the modem
has nowhere to dial, so it doesn't flip modes when it receives an AT dial
command. The serial port effectively becomes server-only, and inbound TCP
sessions can stay open as long as needed.
**This is a modem-layer issue, not a device firmware issue.** The device
is alive and responsive the whole time — confirmed in the BE9558H
recovery by 990 bytes of S3 responses received over a 120s slow-drip
session once the modem was no longer mode-flipping.
---
## Why simpler approaches don't work
| Approach | Why it fails |
|---|---|
| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
The combination of `slow_drip` + cleared `Destination Address` works because:
1. The modem stops mode-flipping → TCP session stays open for the full
drip duration
2. Slow drip rate → device's UART RX FIFO never overflows even if
firmware is busy with event recording
3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
independent chances for the parser to land one valid frame
4. Once one Stop Monitoring is parsed, event recording halts → firmware
has CPU to spare → subsequent operations are trivially easy
---
## Tooling reference
All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
overridable via `SFM_BASE_URL`.
### Endpoints added during BE9558H recovery
| Endpoint | Purpose |
|---|---|
| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
Also changed: default protocol recv timeout dropped from 30s → 10s in
`_build_client`. Added `connect_timeout` knob to same. Cleaned up
unhandled-exception path in `/device/monitor/status` so it returns 502
instead of 500 on protocol timeouts.
### Scripts
| Script | Purpose |
|---|---|
| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
---
## Incident log — BE9558H, 2026-05-16/17
What was wrong: Long-axis geophone developed an offset, constantly above
trigger threshold → constant event recording → after-event ACH set →
modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
Diagnostic path:
1. `/device/info` slow, choked on event walk
2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
but didn't reach the wedged unit
3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
4. Built `/device/stop_monitoring_blind` — first version was a false
positive (no POLL preamble); fixed by including
`SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
5. Verified blind stop works on bench unit
6. Built `/device/stop_monitoring_spam` — 420 successful sends over
5 min, zero behavior change on field unit
7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
confirmed device was not fully locked up
8. Added outbound port-12345 firewall block → outbound attempts now fail
instantly but contention persisted
9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
broken pipe (modem closing on us)
10. Looked at full ACEmanager Port Configuration → **found
`Destination Address: 50.197.32.92` configured**, realized every AT
dial command was triggering a modem mode-flip that killed our inbound
11. Cleared Destination Address + Port → slow_drip held 120s, device
responded with 990 bytes, 39 stop commands acked
12. Disabled ACH at device level via `/device/call_home`, erased events
Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
device level, modem destination cleared (to be restored after physical
service).
Total time from "i was wondering if its possible to" first attempt to
recovery: ~7 hours of intermittent debugging across one evening.
+100
View File
@@ -0,0 +1,100 @@
#!/usr/bin/env bash
# Fire-and-forget Stop Monitoring loop — for wedged or constantly-triggering units.
#
# Hammers POST /device/stop_monitoring_blind in a tight loop. The endpoint
# opens TCP, dumps SESSION_RESET + a few copies of the SUB 0x97 frame, and
# closes — without ever reading an S3 response. Each TCP-won attempt is
# ~50ms of wire activity instead of the multi-frame handshake the regular
# rescue endpoint does, so windows that are too small for the full rescue
# can still land a stop-monitoring command.
#
# Usage:
# ./blind_stop.sh <host> [tcp_port]
#
# Env:
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
# Set to http://localhost:8001/api/sfm to route through
# Terra-View's proxy.
# MAX_ATTEMPTS Default: 600
# SLEEP_S Default: 0 (no backoff — hammer it)
# MAX_TIME_S Default: 15
# CONNECT_TIMEOUT Default: 5
# REPEAT Frames per TCP session (default 3 — increases hit rate
# if the device is busy reading its own buffer).
# STOP_ON_OK Default: 1. Set to 0 to keep hammering indefinitely
# even after successful sends (every 503 means the device
# is in *another* session, every 200 means our bytes got
# through — but the device may not have processed them).
set -u
host="${1:-}"
tcp_port="${2:-9034}"
if [[ -z "$host" ]]; then
echo "usage: $0 <host> [tcp_port]" >&2
exit 2
fi
base="${SFM_BASE_URL:-http://localhost:8200}"
max_attempts="${MAX_ATTEMPTS:-600}"
sleep_s="${SLEEP_S:-0}"
max_time_s="${MAX_TIME_S:-15}"
connect_timeout="${CONNECT_TIMEOUT:-5}"
repeat="${REPEAT:-3}"
stop_on_ok="${STOP_ON_OK:-1}"
url="${base}/device/stop_monitoring_blind?host=${host}&tcp_port=${tcp_port}&connect_timeout=${connect_timeout}&repeat=${repeat}"
echo "blind_stop: target ${host}:${tcp_port} connect_timeout=${connect_timeout}s repeat=${repeat}"
echo "blind_stop: POST ${url}"
echo "blind_stop: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
echo "blind_stop: stop_on_ok=${stop_on_ok}"
echo
ok_count=0
busy_count=0
err_count=0
started=$(date +%s)
for ((i=1; i<=max_attempts; i++)); do
printf "[%4d] %s " "$i" "$(date +%H:%M:%S)"
http_code=$(curl -sS -o /tmp/blind_resp.$$ -w "%{http_code}" \
--max-time "$max_time_s" \
-X POST "$url" || echo "000")
body=$(cat /tmp/blind_resp.$$ 2>/dev/null || true)
rm -f /tmp/blind_resp.$$
case "$http_code" in
200|201)
ok_count=$((ok_count + 1))
echo "SENT $body"
if [[ "$stop_on_ok" == "1" ]]; then
elapsed=$(( $(date +%s) - started ))
echo
echo "blind_stop: success after ${i} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}"
echo "blind_stop: NEXT — wait ~10s, then try the full rescue:"
echo " /home/serversdown/seismo-relay/scripts/rescue_device.sh ${host} ${tcp_port}"
exit 0
fi
;;
503)
busy_count=$((busy_count + 1))
echo "busy (503)"
;;
000)
err_count=$((err_count + 1))
echo "curl error"
;;
*)
err_count=$((err_count + 1))
echo "HTTP $http_code $body" | head -c 400
echo
;;
esac
[[ "$sleep_s" != "0" ]] && sleep "$sleep_s"
done
elapsed=$(( $(date +%s) - started ))
echo
echo "blind_stop: gave up after ${max_attempts} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}" >&2
exit 1
+99
View File
@@ -0,0 +1,99 @@
#!/usr/bin/env bash
# Rescue an uncooperative MiniMate that's busy with another ACH session.
#
# Hammers POST /device/rescue in a tight loop with a short timeout. When the
# device is in an ACH session our SYN either gets refused or silently dropped
# (5s connect timeout inside the endpoint) and we retry immediately. When the
# device is between sessions, our TCP wins, the endpoint disables Auto Call
# Home and erases events inside the same session, then returns success.
#
# Usage:
# ./rescue_device.sh <host> [tcp_port] [--no-erase] [--no-disable-ach]
#
# Examples:
# ./rescue_device.sh 166.246.130.1 9034
# ./rescue_device.sh 166.246.130.1 9034 --no-erase # just silence it
#
# Environment:
# SFM_BASE_URL Defaults to http://localhost:8200 (SFM direct).
# Set to http://localhost:8001/api/sfm to route through
# Terra-View's proxy. Direct mode avoids the proxy's
# 60s timeout, which matters for long-running endpoints.
# MAX_ATTEMPTS Cap on retries (default 600 ≈ 30+ min).
# SLEEP_S Backoff between attempts (default 1).
# MAX_TIME_S Per-request timeout (default 60).
# CONNECT_TIMEOUT TCP connect timeout (default 5).
# RECV_TIMEOUT Per-frame S3 recv timeout (default 5). If POLL or any
# subsequent frame doesn't respond within this window, the
# rescue endpoint bails and this script retries.
set -u
host="${1:-}"
tcp_port="${2:-9034}"
shift 2 2>/dev/null || shift $# 2>/dev/null
if [[ -z "$host" ]]; then
echo "usage: $0 <host> [tcp_port] [--no-erase] [--no-disable-ach]" >&2
exit 2
fi
disable_ach="true"
erase="true"
for arg in "$@"; do
case "$arg" in
--no-erase) erase="false" ;;
--no-disable-ach) disable_ach="false" ;;
*) echo "unknown flag: $arg" >&2; exit 2 ;;
esac
done
base="${SFM_BASE_URL:-http://localhost:8200}"
max_attempts="${MAX_ATTEMPTS:-600}"
sleep_s="${SLEEP_S:-1}"
max_time_s="${MAX_TIME_S:-60}"
connect_timeout="${CONNECT_TIMEOUT:-5}"
recv_timeout="${RECV_TIMEOUT:-5}"
url="${base}/device/rescue?host=${host}&tcp_port=${tcp_port}&disable_ach=${disable_ach}&erase=${erase}&connect_timeout=${connect_timeout}&recv_timeout=${recv_timeout}"
echo "rescue: target ${host}:${tcp_port} disable_ach=${disable_ach} erase=${erase}"
echo "rescue: connect_timeout=${connect_timeout}s recv_timeout=${recv_timeout}s"
echo "rescue: POST ${url}"
echo "rescue: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
echo
started=$(date +%s)
for ((i=1; i<=max_attempts; i++)); do
printf "[%3d] %s " "$i" "$(date +%H:%M:%S)"
http_code=$(curl -sS -o /tmp/rescue_resp.$$ -w "%{http_code}" \
--max-time "$max_time_s" \
-X POST "$url" || echo "000")
body=$(cat /tmp/rescue_resp.$$ 2>/dev/null || true)
rm -f /tmp/rescue_resp.$$
case "$http_code" in
200|201)
elapsed=$(( $(date +%s) - started ))
echo "OK (${elapsed}s total)"
echo "$body"
exit 0
;;
503)
# Connection refused / timeout — device busy in another session. Retry fast.
echo "busy (503)"
;;
000)
echo "curl error (network)"
;;
*)
echo "HTTP $http_code"
echo " $body" | head -c 400
echo
;;
esac
sleep "$sleep_s"
done
echo "rescue: gave up after ${max_attempts} attempts" >&2
exit 1
+44
View File
@@ -0,0 +1,44 @@
#!/usr/bin/env bash
# Hold a single TCP session open and drip stop-monitoring frames at a slow
# rate, so the device's UART RX FIFO has time to drain between sends.
#
# Use when high-rate spam isn't landing — typically because the device's
# firmware is too busy to drain its serial buffer fast enough and bytes
# are being lost to UART overrun.
#
# Usage:
# ./slow_drip.sh <host> [tcp_port] [duration_s]
#
# Env:
# DURATION Default: 120 (seconds; arg 3 overrides). Clamped 1..600.
# INTERVAL Seconds between drip sends (default 3). Lower = more
# aggressive, more risk of FIFO overrun. Higher = safer
# but fewer total drips per duration.
# CONNECT_TIMEOUT Default: 5
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
set -u
host="${1:-}"
tcp_port="${2:-9034}"
duration="${3:-${DURATION:-120}}"
if [[ -z "$host" ]]; then
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
exit 2
fi
base="${SFM_BASE_URL:-http://localhost:8200}"
interval="${INTERVAL:-3}"
connect_timeout="${CONNECT_TIMEOUT:-5}"
url="${base}/device/stop_monitoring_slow_drip?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&interval_s=${interval}&connect_timeout=${connect_timeout}"
echo "slow_drip: target ${host}:${tcp_port} duration=${duration}s interval=${interval}s connect_timeout=${connect_timeout}s"
echo "slow_drip: POST ${url}"
echo
# Give curl enough slack to wait out the duration plus a buffer
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 30 }')
curl -sS --max-time "$max_time" -X POST "$url"
echo
+48
View File
@@ -0,0 +1,48 @@
#!/usr/bin/env bash
# Hammer a device with blind stop-monitoring sessions as fast as possible.
# Single HTTP call kicks off the burst inside SFM (no per-attempt HTTP
# overhead). Default: 10 seconds, ~500 ms per attempt = ~20 attempts/sec.
#
# Usage:
# ./spam_stop.sh <host> [tcp_port] [duration_s]
#
# Examples:
# ./spam_stop.sh 166.246.130.1 # 10s burst
# ./spam_stop.sh 166.246.130.1 9034 30 # 30s burst
# DURATION=60 CONNECT_TIMEOUT=0.2 ./spam_stop.sh 166.246.130.1
#
# Env:
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
# Set to http://localhost:8001/api/sfm to route through
# Terra-View's proxy — but note the proxy has a 60s
# timeout, so long bursts need direct mode.
# DURATION Default: 10 (seconds; arg 3 overrides)
# CONNECT_TIMEOUT Default: 0.5 (seconds)
# REPEAT Default: 3 (stop frames per TCP session)
set -u
host="${1:-}"
tcp_port="${2:-9034}"
duration="${3:-${DURATION:-10}}"
if [[ -z "$host" ]]; then
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
exit 2
fi
base="${SFM_BASE_URL:-http://localhost:8200}"
connect_timeout="${CONNECT_TIMEOUT:-0.5}"
repeat="${REPEAT:-3}"
url="${base}/device/stop_monitoring_spam?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&connect_timeout=${connect_timeout}&repeat=${repeat}"
echo "spam_stop: target ${host}:${tcp_port} duration=${duration}s connect_timeout=${connect_timeout}s repeat=${repeat}"
echo "spam_stop: POST ${url}"
echo
# Give curl enough slack to wait out the duration plus a buffer
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 10 }')
curl -sS --max-time "$max_time" -X POST "$url"
echo
+58
View File
@@ -0,0 +1,58 @@
#!/usr/bin/env bash
# Passive monitor for a misbehaving unit. Every INTERVAL seconds, attempts
# a single short TCP probe + storage_range read and logs the result. Designed
# to run unattended for hours/days and tell you when the unit comes back.
#
# Usage:
# ./watch_unit.sh <host> [tcp_port]
#
# Env:
# INTERVAL Seconds between checks (default 300 = 5 min)
# LOG_FILE Append results here (default /tmp/watch_<host>.log)
# SFM_BASE_URL Default: http://localhost:8200
set -u
host="${1:-}"
tcp_port="${2:-9034}"
if [[ -z "$host" ]]; then
echo "usage: $0 <host> [tcp_port]" >&2
exit 2
fi
interval="${INTERVAL:-300}"
log_file="${LOG_FILE:-/tmp/watch_${host}.log}"
base="${SFM_BASE_URL:-http://localhost:8200}"
url="${base}/device/events/storage_range?host=${host}&tcp_port=${tcp_port}"
echo "watch_unit: target ${host}:${tcp_port} interval=${interval}s log=${log_file}"
echo "watch_unit: Ctrl-C to stop"
while true; do
ts=$(date '+%Y-%m-%d %H:%M:%S')
http_code=$(curl -sS -o /tmp/watch_resp.$$ -w "%{http_code}" \
--max-time 20 "$url" || echo "000")
body=$(cat /tmp/watch_resp.$$ 2>/dev/null || true)
rm -f /tmp/watch_resp.$$
case "$http_code" in
200|201)
# Strip the raw_hex for readability
summary=$(echo "$body" | sed 's/"raw_hex":"[^"]*",*//; s/,*$//' | head -c 200)
echo "$ts REACHABLE $summary" | tee -a "$log_file"
;;
502|503)
err=$(echo "$body" | head -c 150)
echo "$ts ERROR_$http_code $err" | tee -a "$log_file"
;;
000)
echo "$ts CURL_FAIL (network/timeout)" | tee -a "$log_file"
;;
*)
echo "$ts HTTP_$http_code $(echo "$body" | head -c 150)" | tee -a "$log_file"
;;
esac
sleep "$interval"
done
+69
View File
@@ -491,6 +491,75 @@ class SeismoDb:
) )
return cur.rowcount > 0 return cur.rowcount > 0
def delete_event(self, event_id: str) -> Optional[dict]:
"""
Hard-delete one event row by id. Returns the deleted row (so the
caller can clean up any on-disk files referenced by it) or None
if no row matched.
"""
with self._connect() as conn:
row = conn.execute(
"SELECT * FROM events WHERE id = ?", (event_id,),
).fetchone()
if row is None:
return None
conn.execute("DELETE FROM events WHERE id = ?", (event_id,))
return dict(row)
def delete_events_bulk(
self,
serial: Optional[str] = None,
from_dt: Optional[datetime.datetime] = None,
to_dt: Optional[datetime.datetime] = None,
false_trigger: Optional[bool] = None,
ids: Optional[list[str]] = None,
) -> list[dict]:
"""
Hard-delete events matching the given filters. Returns the list
of deleted row dicts. Refuses to delete with no filters at all
(would wipe the whole table) — raises ValueError.
Filter semantics match query_events: serial / from_dt / to_dt /
false_trigger combine with AND. `ids` is an additional inclusion
list (event_id IN (...)); if supplied alongside other filters,
only rows matching all conditions are deleted.
"""
clauses: list[str] = []
params: list = []
if serial:
clauses.append("serial = ?")
params.append(serial)
if from_dt:
clauses.append("timestamp >= ?")
params.append(from_dt.isoformat())
if to_dt:
clauses.append("timestamp <= ?")
params.append(to_dt.isoformat())
if false_trigger is not None:
clauses.append("false_trigger = ?")
params.append(1 if false_trigger else 0)
if ids:
placeholders = ",".join("?" * len(ids))
clauses.append(f"id IN ({placeholders})")
params.extend(ids)
if not clauses:
raise ValueError(
"delete_events_bulk refuses to delete with no filters "
"(would wipe the entire events table)"
)
where = "WHERE " + " AND ".join(clauses)
with self._connect() as conn:
rows = conn.execute(
f"SELECT * FROM events {where}", params,
).fetchall()
if rows:
conn.execute(f"DELETE FROM events {where}", params)
return [dict(r) for r in rows]
def update_event_review(self, event_id: str, review: dict) -> bool: def update_event_review(self, event_id: str, review: dict) -> bool:
""" """
Sync derived index columns from a sidecar's `review` block. Sync derived index columns from a sidecar's `review` block.
+720 -2
View File
@@ -36,6 +36,7 @@ from __future__ import annotations
import datetime import datetime
import logging import logging
import socket
import sys import sys
import tempfile import tempfile
import threading import threading
@@ -63,7 +64,9 @@ from minimateplus.protocol import ProtocolError
from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp
from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT
from minimateplus.blastware_file import write_blastware_file, blastware_filename from minimateplus.blastware_file import write_blastware_file, blastware_filename
from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform, _decode_event_count
from minimateplus.framing import build_bw_write_frame, SESSION_RESET, POLL_PROBE, POLL_DATA
from minimateplus.protocol import SUB_STOP_MONITORING
from sfm import event_hdf5 from sfm import event_hdf5
from sfm.cache import SFMCache, get_cache from sfm.cache import SFMCache, get_cache
from sfm.database import SeismoDb from sfm.database import SeismoDb
@@ -268,7 +271,8 @@ def _build_client(
baud: int, baud: int,
host: Optional[str], host: Optional[str],
tcp_port: int, tcp_port: int,
timeout: float = 30.0, timeout: float = 10.0,
connect_timeout: Optional[float] = None,
) -> MiniMateClient: ) -> MiniMateClient:
""" """
Return a MiniMateClient configured for either serial or TCP transport. Return a MiniMateClient configured for either serial or TCP transport.
@@ -276,11 +280,23 @@ def _build_client(
TCP takes priority if *host* is supplied; otherwise *port* (serial) is used. TCP takes priority if *host* is supplied; otherwise *port* (serial) is used.
Raises HTTPException(422) if neither is provided. Raises HTTPException(422) if neither is provided.
Default *timeout* is 10s — the device usually responds in well under a
second over cellular; 10s leaves comfortable headroom for retransmits
while still failing reasonably fast when a unit is wedged.
Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform
download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over
cellular and each individual recv must complete within the timeout window. cellular and each individual recv must complete within the timeout window.
*connect_timeout* (TCP only) overrides the TcpTransport default (10s) for
the initial TCP SYN/handshake. Use a small value (e.g. 5s) in rescue/race
scenarios where the device is busy in another session and you want to
fail fast and retry quickly.
""" """
if host: if host:
if connect_timeout is not None:
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
else:
transport = TcpTransport(host, port=tcp_port) transport = TcpTransport(host, port=tcp_port)
log.debug("TCP transport: %s:%d timeout=%.0fs", host, tcp_port, timeout) log.debug("TCP transport: %s:%d timeout=%.0fs", host, tcp_port, timeout)
return MiniMateClient(transport=transport, timeout=timeout) return MiniMateClient(transport=transport, timeout=timeout)
@@ -1095,6 +1111,7 @@ def device_monitor_status(
cached["_cached"] = True cached["_cached"] = True
return cached return cached
try:
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client: with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
try: try:
client.poll() client.poll()
@@ -1102,6 +1119,15 @@ def device_monitor_status(
log.warning("monitor status poll retry: %s", exc) log.warning("monitor status poll retry: %s", exc)
client.poll() client.poll()
status = client.get_monitor_status() status = client.get_monitor_status()
except HTTPException:
raise
except ProtocolError as exc:
# Includes minimateplus.protocol.TimeoutError ("device unresponsive").
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
except Exception as exc:
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
result: dict = {"is_monitoring": status.is_monitoring} result: dict = {"is_monitoring": status.is_monitoring}
if status.battery_v is not None: if status.battery_v is not None:
@@ -1117,6 +1143,529 @@ def device_monitor_status(
return result return result
@app.get("/device/events/storage_range")
def device_events_storage_range(
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
baud: int = Query(38400, description="Serial baud rate"),
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
) -> dict:
"""
Read the device's event storage range (SUB 0x06) — first and last
stored event keys. POLL handshake + one read; no connect(), no
config reads, no event walk. Completes in ~2 seconds.
Useful for checking whether the device has any stored events
without invoking the slow count_events() 1E/1F chain. Both keys =
`01110000` means the device is empty.
"""
log.info("GET /device/events/storage_range host=%s tcp_port=%s", host, tcp_port)
try:
def _do():
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
try:
client.poll()
except Exception as exc:
log.warning("storage_range poll retry: %s", exc)
client.poll()
proto = client._require_proto()
return proto.read_event_storage_range()
rng = _run_with_retry(_do, is_tcp=_is_tcp(host))
except HTTPException:
raise
except ProtocolError as exc:
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
except Exception as exc:
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
data = bytes(rng.data)
result: dict = {"raw_len": len(data), "raw_hex": data.hex()}
if len(data) >= 8:
first_key = data[-8:-4].hex()
last_key = data[-4:].hex()
result["first_key"] = first_key
result["last_key"] = last_key
result["is_empty"] = (first_key == "01110000" and last_key == "01110000")
return result
@app.get("/device/events/index")
def device_events_index(
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
baud: int = Query(38400, description="Serial baud rate"),
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
) -> dict:
"""
Read the device's event index (SUB 0x08) — returns the lifetime
event counter at data[10:12] (uint16 BE). POLL handshake + one
read; no connect(), no config reads, no event walk. ~2 seconds.
Note: this is a LIFETIME counter (events ever recorded) — it does
NOT decrement when events are erased. After an erase, the device
counter resets to 0 only on the next recorded event. For "are
there stored events right now?" use /device/events/storage_range
instead.
"""
log.info("GET /device/events/index host=%s tcp_port=%s", host, tcp_port)
try:
def _do():
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
try:
client.poll()
except Exception as exc:
log.warning("event_index poll retry: %s", exc)
client.poll()
proto = client._require_proto()
return proto.read_event_index()
idx_raw = _run_with_retry(_do, is_tcp=_is_tcp(host))
except HTTPException:
raise
except ProtocolError as exc:
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
except Exception as exc:
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
raw = bytes(idx_raw)
result: dict = {"raw_len": len(raw), "raw_hex": raw.hex()}
try:
result["lifetime_count"] = _decode_event_count(raw)
except Exception as exc:
result["decode_error"] = str(exc)
return result
@app.post("/device/events/erase")
def device_events_erase(
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
baud: int = Query(38400, description="Serial baud rate"),
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
) -> dict:
"""
Erase ALL stored events from the device memory.
Sequence: SUB 0xA3 → 0x1C → 0x06 → 0xA2 (confirmed 2026-04-11).
After this call the unit's event memory is empty and event keys reset
to 0x01110000. The device returns to its normal operating state
automatically — no restart-monitoring call is needed.
Note: this endpoint does NOT touch the ACH server's `ach_state.json`.
If a call-home subsequently lands on the ACH server, its post-erase
detection logic (max(device_keys) vs max_downloaded_key) handles the
key-counter rollback.
"""
log.info("POST /device/events/erase port=%s host=%s tcp_port=%s", port, host, tcp_port)
try:
def _do():
with _build_client(port, baud, host, tcp_port) as client:
client.connect()
client.delete_all_events()
_run_with_retry(_do, is_tcp=_is_tcp(host))
except HTTPException:
raise
except ProtocolError as exc:
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
except OSError as exc:
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
except Exception as exc:
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
cleared = get_cache().clear_device(conn_key)
return {
"status": "ok",
"message": "Device event memory cleared",
"cache_cleared": cleared,
}
@app.post("/device/stop_monitoring_blind")
def device_stop_monitoring_blind(
host: str = Query(..., description="TCP host — modem IP"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
repeat: int = Query(3, description="How many times to send the frame within one TCP session (default 3)"),
) -> dict:
"""
Fire-and-forget Stop Monitoring (SUB 0x97). TCP-only.
Opens a TCP session, dumps the FULL handshake the device's protocol
state machine expects — `SESSION_RESET + POLL_PROBE + SESSION_RESET +
POLL_DATA` — and then N back-to-back copies of the stop-monitoring
frame. Does NOT read any S3 response. Succeeds as long as the bytes
left the socket.
The POLL handshake bytes are required: monitoring units ignore command
frames received without a preceding POLL exchange. Sending the POLL
bytes "blind" (without reading the responses) still works because the
device processes inbound bytes in order regardless of whether we drain
its outbound buffer.
Idempotent: the device processes extra copies of SUB 0x97 the same as
one (already-stopped is a no-op).
Returns the number of bytes sent. A 503 means the TCP connect failed
(device busy in another session — caller should retry).
"""
log.info(
"POST /device/stop_monitoring_blind host=%s tcp_port=%s connect_timeout=%.1fs repeat=%d",
host, tcp_port, connect_timeout, repeat,
)
if repeat < 1:
repeat = 1
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
payload = (
SESSION_RESET + POLL_PROBE
+ SESSION_RESET + POLL_DATA
+ (frame * repeat)
)
t0 = time.monotonic()
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
try:
transport.connect()
except OSError as exc:
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
try:
transport.write(payload)
except OSError as exc:
transport.disconnect()
raise HTTPException(status_code=502, detail=f"Send error: {exc}") from exc
finally:
transport.disconnect()
return {
"status": "sent",
"bytes_sent": len(payload),
"frame_size": len(frame),
"repeat": repeat,
"elapsed_s": round(time.monotonic() - t0, 3),
}
@app.post("/device/stop_monitoring_slow_drip")
def device_stop_monitoring_slow_drip(
host: str = Query(..., description="TCP host — modem IP"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
duration_s: float = Query(120.0, description="Total time to hold the session open (seconds)"),
interval_s: float = Query(3.0, description="Seconds between drip sends"),
connect_timeout: float = Query(5.0, description="TCP connect timeout"),
) -> dict:
"""
Hold a single TCP session open for *duration_s* seconds and drip
stop-monitoring frames into the device at a slow rate so its UART
RX FIFO has time to drain between sends.
Sequence:
1. Open TCP session.
2. Send the wake preamble: SESSION_RESET + POLL_PROBE +
SESSION_RESET + POLL_DATA (so the device's protocol parser
is primed for a write command).
3. Wait interval_s for the device to drain.
4. Drip-send (SESSION_RESET + stop_monitoring_frame) every
interval_s until duration_s elapses.
5. Opportunistically drain any bytes the device sends back (so
the modem's TX queue doesn't fill up). Successful drains are
counted in `bytes_received` — non-zero strongly suggests the
device has started responding to us.
6. Close.
Designed for units whose firmware is too busy with event-recording
to keep up with high-rate spam. Heavy spam overruns the UART FIFO;
slow drip stays under it.
Compared to spam mode: ~40× fewer bytes/sec on the wire, but each
byte has a much higher chance of actually being parsed.
"""
log.info(
"POST /device/stop_monitoring_slow_drip host=%s tcp_port=%s duration=%.1fs interval=%.2fs connect_timeout=%.1fs",
host, tcp_port, duration_s, interval_s, connect_timeout,
)
duration_s = max(1.0, min(duration_s, 600.0)) # clamp 1s..10min
interval_s = max(0.1, min(interval_s, 30.0))
connect_timeout = max(0.1, connect_timeout)
stop_frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
preamble = (
SESSION_RESET + POLL_PROBE
+ SESSION_RESET + POLL_DATA
)
t0 = time.monotonic()
drips_sent = 0
bytes_sent = 0
bytes_received = 0
try:
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
except OSError as exc:
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
# Short read timeout so opportunistic drains don't block.
sock.settimeout(0.1)
try:
# Initial wake preamble.
try:
sock.sendall(preamble)
bytes_sent += len(preamble)
except OSError as exc:
raise HTTPException(status_code=502, detail=f"Preamble send failed: {exc}") from exc
# Initial settle.
time.sleep(interval_s)
# Try a non-blocking drain of any response to the wake.
try:
data = sock.recv(4096)
if data:
bytes_received += len(data)
log.info("slow_drip: device responded to wake preamble (%d bytes)", len(data))
except socket.timeout:
pass
except OSError:
pass
deadline = t0 + duration_s
drip = SESSION_RESET + stop_frame # 2 + 21 = 23 bytes per drip
send_error: Optional[str] = None
while time.monotonic() < deadline:
try:
sock.sendall(drip)
bytes_sent += len(drip)
drips_sent += 1
except OSError as exc:
send_error = f"{exc}"
log.warning("slow_drip: send failed after %d drips: %s", drips_sent, exc)
break
# Drain any inbound bytes; ignore timeouts.
try:
data = sock.recv(4096)
if data:
bytes_received += len(data)
except socket.timeout:
pass
except OSError:
pass
# Sleep the interval, but don't oversleep past the deadline.
remaining = deadline - time.monotonic()
if remaining <= 0:
break
time.sleep(min(interval_s, remaining))
finally:
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
try:
sock.close()
except OSError:
pass
elapsed = time.monotonic() - t0
log.info(
"slow_drip done — drips=%d bytes_sent=%d bytes_received=%d in %.1fs",
drips_sent, bytes_sent, bytes_received, elapsed,
)
return {
"status": "done",
"duration_s": round(elapsed, 2),
"drips_sent": drips_sent,
"bytes_sent": bytes_sent,
"bytes_received": bytes_received,
"preamble_bytes": len(preamble),
"drip_bytes": len(drip),
"send_error": send_error,
}
@app.post("/device/stop_monitoring_spam")
def device_stop_monitoring_spam(
host: str = Query(..., description="TCP host — modem IP"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
duration_s: float = Query(10.0, description="How long to hammer the device for (seconds)"),
connect_timeout: float = Query(0.5, description="Per-attempt TCP connect timeout (default 0.5s)"),
repeat: int = Query(3, description="Stop frames per TCP session (default 3)"),
) -> dict:
"""
Hammer the device with blind stop-monitoring sessions as fast as
possible for `duration_s` seconds. Each attempt: open TCP → write
SESSION_RESET + POLL handshake + STOP frames × repeat → close. No
response is read.
Designed for units that are aggressively calling home — short
connect_timeout (default 500 ms) means every failed attempt loses
only that much time before retrying, so we can fit several attempts
per second even when the modem is mostly busy with its own outbound
sessions.
Single HTTP call kicks off the whole burst; counters are returned
when it finishes. No streaming; if you want live progress, watch
SFM logs.
"""
log.info(
"POST /device/stop_monitoring_spam host=%s tcp_port=%s duration=%.1fs connect_timeout=%.3fs repeat=%d",
host, tcp_port, duration_s, connect_timeout, repeat,
)
if repeat < 1:
repeat = 1
duration_s = max(0.1, min(duration_s, 300.0)) # clamp 0.1s..5min
connect_timeout = max(0.05, connect_timeout)
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
payload = (
SESSION_RESET + POLL_PROBE
+ SESSION_RESET + POLL_DATA
+ (frame * repeat)
)
t0 = time.monotonic()
deadline = t0 + duration_s
sent_ok = 0
connect_failed = 0
write_failed = 0
while time.monotonic() < deadline:
try:
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
except OSError:
connect_failed += 1
continue
try:
sock.sendall(payload)
sent_ok += 1
except OSError:
write_failed += 1
finally:
try:
sock.shutdown(socket.SHUT_RDWR)
except OSError:
pass
try:
sock.close()
except OSError:
pass
elapsed = time.monotonic() - t0
total = sent_ok + connect_failed + write_failed
log.info(
"stop_monitoring_spam done — sent=%d connect_failed=%d write_failed=%d in %.2fs",
sent_ok, connect_failed, write_failed, elapsed,
)
return {
"status": "done",
"duration_s": round(elapsed, 2),
"sent_ok": sent_ok,
"connect_failed": connect_failed,
"write_failed": write_failed,
"total_attempts": total,
"rate_attempts_per_s": round(total / elapsed, 1) if elapsed > 0 else 0,
"payload_bytes": len(payload),
}
@app.post("/device/rescue")
def device_rescue(
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
baud: int = Query(38400, description="Serial baud rate"),
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
recv_timeout: float = Query(5.0, description="Per-frame S3 recv timeout in seconds (default 5)"),
disable_ach: bool = Query(True, description="Disable Auto Call Home on the device before erasing"),
erase: bool = Query(True, description="Erase all stored events after disabling ACH"),
) -> dict:
"""
Rescue an uncooperative unit by squeezing all maintenance work into a
single TCP session.
Designed for devices that are actively calling home to a separate ACH
server (BW or otherwise). While we hold this TCP session open the
modem cannot accept an inbound ACH call, so the order matters:
1. Short-timeout TCP connect (fails fast if the device is busy in
another session — the caller should retry in a tight loop).
2. POLL handshake.
3. (optional) Write call_home config with auto_call_home_enabled=false
so the device stops calling out even after we drop the session.
4. (optional) Erase all stored events (0xA3 → 0x1C → 0x06 → 0xA2).
5. Close the TCP session.
Both `disable_ach` and `erase` default to true. Pass `?erase=false` if
you only want to silence the unit without wiping its events.
Caller pattern (bash):
until curl -sS --max-time 30 -X POST \\
"http://localhost:8001/api/sfm/device/rescue?host=$IP&tcp_port=$P"; do
sleep 1
done
"""
log.info(
"POST /device/rescue host=%s tcp_port=%s connect_timeout=%.1fs recv_timeout=%.1fs disable_ach=%s erase=%s",
host, tcp_port, connect_timeout, recv_timeout, disable_ach, erase,
)
steps: list[dict] = []
t0 = time.monotonic()
try:
with _build_client(
port, baud, host, tcp_port,
timeout=recv_timeout,
connect_timeout=connect_timeout,
) as client:
steps.append({"step": "tcp_connect", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
try:
client.poll()
except Exception as exc:
log.warning("rescue: poll retry: %s", exc)
client.poll()
steps.append({"step": "poll", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
if disable_ach:
client.set_call_home_config(auto_call_home_enabled=False)
steps.append({"step": "disable_ach", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
if erase:
client.delete_all_events()
steps.append({"step": "erase", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
except ProtocolError as exc:
steps.append({"step": "error", "ok": False, "detail": f"protocol: {exc}"})
raise HTTPException(status_code=502, detail={"message": f"Protocol error: {exc}", "steps": steps}) from exc
except OSError as exc:
steps.append({"step": "error", "ok": False, "detail": f"socket: {exc}"})
# Connection refused / timed out → device busy in another session. Caller should retry.
raise HTTPException(status_code=503, detail={"message": f"Connection error: {exc}", "steps": steps}) from exc
except Exception as exc:
steps.append({"step": "error", "ok": False, "detail": str(exc)})
raise HTTPException(status_code=500, detail={"message": f"Device error: {exc}", "steps": steps}) from exc
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
cleared = get_cache().clear_device(conn_key)
return {
"status": "ok",
"elapsed_s": round(time.monotonic() - t0, 2),
"disable_ach": disable_ach,
"erase": erase,
"steps": steps,
"cache_cleared": cleared,
}
@app.post("/device/monitor/start") @app.post("/device/monitor/start")
def device_monitor_start( def device_monitor_start(
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"), port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
@@ -1403,6 +1952,175 @@ def db_set_false_trigger(
return {"status": "ok", "event_id": event_id, "false_trigger": value} return {"status": "ok", "event_id": event_id, "false_trigger": value}
def _cleanup_event_files(row: dict) -> dict:
"""
Best-effort cleanup of on-disk waveform / sidecar / pickle / hdf5 files
associated with a deleted event row. Returns a dict of {kind: bool} for
what was actually removed (true) vs not found / failed (false).
"""
serial = row.get("serial")
bw_name = row.get("blastware_filename")
a5_name = row.get("a5_pickle_filename")
sc_name = row.get("sidecar_filename")
removed: dict = {}
if not serial:
return removed
store = _get_store()
# blastware_filename is the "base" — other files derive their paths from it
# via WaveformStore helpers. Sidecar and a5 may also be stored under their
# own column values if they ever diverged historically.
base_name = bw_name or a5_name or sc_name
if base_name:
bw_path, a5_path = store.paths_for(serial, base_name)
sc_path = store.sidecar_path_for(serial, base_name)
h5_path = store.hdf5_path_for(serial, base_name)
for kind, p in [("blastware", bw_path), ("a5_pickle", a5_path),
("sidecar", sc_path), ("hdf5", h5_path)]:
try:
if p.exists():
p.unlink()
removed[kind] = True
except OSError as exc:
log.warning("file cleanup failed for %s (%s): %s", p, kind, exc)
removed[kind] = False
return removed
@app.delete("/db/events/{event_id}")
def db_delete_event(event_id: str) -> dict:
"""
Hard-delete a single event from the SFM events table and remove any
associated on-disk waveform/sidecar/pickle/hdf5 files.
Returns 404 if the event_id is not found.
"""
log.info("DELETE /db/events/%s", event_id)
deleted = _get_db().delete_event(event_id)
if deleted is None:
raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
files_removed = _cleanup_event_files(deleted)
return {
"status": "ok",
"event_id": event_id,
"files_removed": files_removed,
}
class BulkDeleteBody(BaseModel):
"""Body for POST /db/events/delete_bulk."""
serial: Optional[str] = None
from_dt: Optional[str] = None # ISO-8601
to_dt: Optional[str] = None # ISO-8601
false_trigger: Optional[bool] = None
ids: Optional[list[str]] = None
confirm: bool = False
# Safety: when no `ids` are supplied, require this many max rows to
# actually be deleted; if the matched count exceeds it, the endpoint
# returns a dry-run-style summary instead. Pass None to disable.
max_rows: Optional[int] = 10000
@app.post("/db/events/delete_bulk")
def db_delete_events_bulk(body: BulkDeleteBody) -> dict:
"""
Hard-delete multiple events at once, by filter and/or by id list.
Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND,
matching the same semantics as `GET /db/events`. `ids` is an additional
inclusion list. At least one filter or non-empty `ids` MUST be supplied
— refusing to wipe the whole table.
Safety knobs:
- `confirm` MUST be `true` to actually delete. When false (default),
returns the match count without deleting (dry-run).
- `max_rows` (default 10,000) caps how many rows can be deleted in one
call by-filter; if the match count exceeds it, the endpoint returns
a count summary without deleting. Ignored when only `ids` is used.
Returns:
{
"status": "ok" | "dry_run" | "too_many",
"matched": <int>,
"deleted": <int>, # 0 unless status == "ok"
"files_removed": <int>, # total file unlink successes
"sample_serials": [...], # up to 5 distinct serials touched
}
"""
log.info(
"POST /db/events/delete_bulk serial=%s from=%s to=%s ft=%s ids=%d confirm=%s max=%s",
body.serial, body.from_dt, body.to_dt, body.false_trigger,
len(body.ids or []), body.confirm, body.max_rows,
)
from_parsed = datetime.datetime.fromisoformat(body.from_dt) if body.from_dt else None
to_parsed = datetime.datetime.fromisoformat(body.to_dt) if body.to_dt else None
db = _get_db()
# Dry-run path: count matches without deleting.
rows = db.query_events(
serial=body.serial,
from_dt=from_parsed,
to_dt=to_parsed,
false_trigger=body.false_trigger,
limit=1_000_000, # we want a true count, not a page
offset=0,
)
if body.ids:
id_set = set(body.ids)
rows = [r for r in rows if r["id"] in id_set]
matched = len(rows)
sample_serials = sorted({r.get("serial") for r in rows[:50] if r.get("serial")})[:5]
if not body.confirm:
return {
"status": "dry_run",
"matched": matched,
"deleted": 0,
"files_removed": 0,
"sample_serials": sample_serials,
"hint": "Set confirm=true in the request body to actually delete.",
}
if body.max_rows is not None and not body.ids and matched > body.max_rows:
return {
"status": "too_many",
"matched": matched,
"deleted": 0,
"files_removed": 0,
"sample_serials": sample_serials,
"hint": (
f"Matched {matched} > max_rows={body.max_rows}. Either raise "
f"max_rows in the body, narrow the filter, or supply an "
f"explicit `ids` list."
),
}
try:
deleted_rows = db.delete_events_bulk(
serial=body.serial,
from_dt=from_parsed,
to_dt=to_parsed,
false_trigger=body.false_trigger,
ids=body.ids,
)
except ValueError as exc:
raise HTTPException(status_code=422, detail=str(exc)) from exc
files_removed = 0
for row in deleted_rows:
result = _cleanup_event_files(row)
files_removed += sum(1 for ok in result.values() if ok)
return {
"status": "ok",
"matched": matched,
"deleted": len(deleted_rows),
"files_removed": files_removed,
"sample_serials": sample_serials,
}
# ── /db/events/{id} — waveform file accessors ───────────────────────────────── # ── /db/events/{id} — waveform file accessors ─────────────────────────────────
# #
# These endpoints serve files from the persistent WaveformStore, so a Blastware # These endpoints serve files from the persistent WaveformStore, so a Blastware