merge: update to 0.17.0 #21
@@ -0,0 +1,255 @@
|
||||
# Runbook — Recovering a wedged unit stuck in a call-home loop
|
||||
|
||||
**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
|
||||
|
||||
A field unit with a stuck-triggered geophone (or any hardware fault causing
|
||||
constant event triggering) will record events back-to-back, and if Auto Call
|
||||
Home is set to "After Event Recorded" the device will dial the office BW
|
||||
ACH server in a tight loop. Combined with a Sierra Wireless modem in
|
||||
bidirectional serial-TCP mode, this makes the unit effectively unreachable
|
||||
from SFM — every TCP connection we open gets killed when the modem flips
|
||||
from server-mode to client-mode to honor the device's next AT dial command.
|
||||
|
||||
This runbook describes how to break the loop and recover control.
|
||||
|
||||
---
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
|
||||
- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
|
||||
waiting for POLL response) or 503 (TCP connect refused).
|
||||
- ACEmanager serial log shows repeating
|
||||
`Connect to IP: <BW_IP> Port: <BW_PORT>` → `Shutdown TCP socket` cycles
|
||||
every 30-60 seconds.
|
||||
- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
|
||||
`sent_ok` but the device's monitoring state never changes.
|
||||
- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
|
||||
but before completing the drip loop.
|
||||
|
||||
If you see *all* of these, the unit is in this exact failure mode.
|
||||
|
||||
---
|
||||
|
||||
## Quick reference — how to recover
|
||||
|
||||
You need **ACEmanager access** to the unit's modem.
|
||||
|
||||
### Step 1: stop the modem's mode-flipping
|
||||
|
||||
In ACEmanager → **Serial → Port Configuration**:
|
||||
|
||||
| Field | Set to |
|
||||
|---|---|
|
||||
| **Destination Address** | clear (blank) |
|
||||
| **Destination Port** | `0` |
|
||||
|
||||
Click **Apply**. This removes the modem's auto-dial-out target. The device's
|
||||
AT dial commands now error back at the modem instead of triggering a
|
||||
mode-flip, so the modem stays in TCP-server mode permanently and our inbound
|
||||
TCP sessions stay alive.
|
||||
|
||||
*(Optional belt-and-suspenders: also add the BW server's port to
|
||||
**Security → Port Filtering - Outbound** as a blocked port, with
|
||||
Outbound Port Filtering Mode = Blocked Ports.)*
|
||||
|
||||
### Step 2: stop monitoring on the device (slow drip)
|
||||
|
||||
From the SFM host:
|
||||
|
||||
```bash
|
||||
/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
|
||||
```
|
||||
|
||||
Defaults are 120s duration with a drip every 3s. Watch the response:
|
||||
|
||||
- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
|
||||
- `bytes_received > 0` → device is responding ✓ (this is the success signal)
|
||||
|
||||
If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
|
||||
hold — re-check ACEmanager, may need to reboot the modem after Apply.
|
||||
|
||||
### Step 3: confirm monitoring stopped
|
||||
|
||||
```bash
|
||||
curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
|
||||
# expect: {"is_monitoring": false, ...}
|
||||
```
|
||||
|
||||
### Step 4: disable ACH at the device level + erase corrupted events
|
||||
|
||||
Either fire the rescue endpoint:
|
||||
|
||||
```bash
|
||||
/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
|
||||
```
|
||||
|
||||
Or do the two steps manually:
|
||||
|
||||
```bash
|
||||
# Disable ACH in the device's compliance config
|
||||
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{"auto_call_home_enabled": false}'
|
||||
|
||||
# Erase corrupted event chain
|
||||
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
|
||||
```
|
||||
|
||||
You can also do this via the SFM standalone UI → **Call Home** tab → set
|
||||
`Enable Auto Call Home` to `Disabled` → **Write to Device**.
|
||||
|
||||
### Step 5: restore modem config (housekeeping)
|
||||
|
||||
Once the device-side ACH is disabled, restore the modem's Destination
|
||||
Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
|
||||
ACEmanager. The modem will resume normal bidirectional behavior, but the
|
||||
unit won't issue any dial commands until ACH is explicitly re-enabled on
|
||||
the device.
|
||||
|
||||
### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
|
||||
fault is repaired. If you do, the call-home loop starts again immediately
|
||||
and you'll be running this runbook a second time.
|
||||
|
||||
---
|
||||
|
||||
## Why this works — the failure mode explained
|
||||
|
||||
The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
|
||||
at any moment:
|
||||
|
||||
- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
|
||||
TCP to the device's serial port. This is what we need to interact with
|
||||
the device.
|
||||
- **Client mode** — when the device sends an AT dial command on its serial
|
||||
TX line, the modem opens an outbound TCP to `Destination Address:Port`
|
||||
and bridges that to serial.
|
||||
|
||||
A serial port in this configuration is **bidirectional**: the modem flips
|
||||
between server and client modes on demand. When the device's firmware is
|
||||
healthy and only dials occasionally, this works fine.
|
||||
|
||||
When the unit is constantly triggering events and ACH is set to "After
|
||||
Event Recorded", the device sends an AT dial command every few seconds.
|
||||
Each one causes the modem to:
|
||||
|
||||
1. Drop any active inbound TCP session
|
||||
2. Flip to client mode
|
||||
3. Attempt outbound TCP to `Destination Address:Port`
|
||||
4. Hang for up to a minute waiting for it to succeed/fail
|
||||
5. Drop back to server mode
|
||||
|
||||
**During the entire hang, no inbound TCP can establish.** Even between
|
||||
hangs, the modem closes any existing inbound session before flipping. So
|
||||
any tool that needs more than a few seconds of held TCP (e.g. POLL +
|
||||
config read + write) gets repeatedly kicked off.
|
||||
|
||||
Clearing `Destination Address` removes step 3-4 from the cycle: the modem
|
||||
has nowhere to dial, so it doesn't flip modes when it receives an AT dial
|
||||
command. The serial port effectively becomes server-only, and inbound TCP
|
||||
sessions can stay open as long as needed.
|
||||
|
||||
**This is a modem-layer issue, not a device firmware issue.** The device
|
||||
is alive and responsive the whole time — confirmed in the BE9558H
|
||||
recovery by 990 bytes of S3 responses received over a 120s slow-drip
|
||||
session once the modem was no longer mode-flipping.
|
||||
|
||||
---
|
||||
|
||||
## Why simpler approaches don't work
|
||||
|
||||
| Approach | Why it fails |
|
||||
|---|---|
|
||||
| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
|
||||
| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
|
||||
| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
|
||||
| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
|
||||
| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
|
||||
| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
|
||||
|
||||
The combination of `slow_drip` + cleared `Destination Address` works because:
|
||||
|
||||
1. The modem stops mode-flipping → TCP session stays open for the full
|
||||
drip duration
|
||||
2. Slow drip rate → device's UART RX FIFO never overflows even if
|
||||
firmware is busy with event recording
|
||||
3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
|
||||
independent chances for the parser to land one valid frame
|
||||
4. Once one Stop Monitoring is parsed, event recording halts → firmware
|
||||
has CPU to spare → subsequent operations are trivially easy
|
||||
|
||||
---
|
||||
|
||||
## Tooling reference
|
||||
|
||||
All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
|
||||
`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
|
||||
overridable via `SFM_BASE_URL`.
|
||||
|
||||
### Endpoints added during BE9558H recovery
|
||||
|
||||
| Endpoint | Purpose |
|
||||
|---|---|
|
||||
| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
|
||||
| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
|
||||
| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
|
||||
| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
|
||||
| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
|
||||
| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
|
||||
| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
|
||||
|
||||
Also changed: default protocol recv timeout dropped from 30s → 10s in
|
||||
`_build_client`. Added `connect_timeout` knob to same. Cleaned up
|
||||
unhandled-exception path in `/device/monitor/status` so it returns 502
|
||||
instead of 500 on protocol timeouts.
|
||||
|
||||
### Scripts
|
||||
|
||||
| Script | Purpose |
|
||||
|---|---|
|
||||
| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
|
||||
| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
|
||||
| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
|
||||
| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
|
||||
| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
|
||||
|
||||
---
|
||||
|
||||
## Incident log — BE9558H, 2026-05-16/17
|
||||
|
||||
What was wrong: Long-axis geophone developed an offset, constantly above
|
||||
trigger threshold → constant event recording → after-event ACH set →
|
||||
modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
|
||||
Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
|
||||
|
||||
Diagnostic path:
|
||||
|
||||
1. `/device/info` slow, choked on event walk
|
||||
2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
|
||||
but didn't reach the wedged unit
|
||||
3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
|
||||
4. Built `/device/stop_monitoring_blind` — first version was a false
|
||||
positive (no POLL preamble); fixed by including
|
||||
`SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
|
||||
5. Verified blind stop works on bench unit
|
||||
6. Built `/device/stop_monitoring_spam` — 420 successful sends over
|
||||
5 min, zero behavior change on field unit
|
||||
7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
|
||||
confirmed device was not fully locked up
|
||||
8. Added outbound port-12345 firewall block → outbound attempts now fail
|
||||
instantly but contention persisted
|
||||
9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
|
||||
broken pipe (modem closing on us)
|
||||
10. Looked at full ACEmanager Port Configuration → **found
|
||||
`Destination Address: 50.197.32.92` configured**, realized every AT
|
||||
dial command was triggering a modem mode-flip that killed our inbound
|
||||
11. Cleared Destination Address + Port → slow_drip held 120s, device
|
||||
responded with 990 bytes, 39 stop commands acked
|
||||
12. Disabled ACH at device level via `/device/call_home`, erased events
|
||||
|
||||
Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
|
||||
device level, modem destination cleared (to be restored after physical
|
||||
service).
|
||||
|
||||
Total time from "i was wondering if its possible to" first attempt to
|
||||
recovery: ~7 hours of intermittent debugging across one evening.
|
||||
Executable
+100
@@ -0,0 +1,100 @@
|
||||
#!/usr/bin/env bash
|
||||
# Fire-and-forget Stop Monitoring loop — for wedged or constantly-triggering units.
|
||||
#
|
||||
# Hammers POST /device/stop_monitoring_blind in a tight loop. The endpoint
|
||||
# opens TCP, dumps SESSION_RESET + a few copies of the SUB 0x97 frame, and
|
||||
# closes — without ever reading an S3 response. Each TCP-won attempt is
|
||||
# ~50ms of wire activity instead of the multi-frame handshake the regular
|
||||
# rescue endpoint does, so windows that are too small for the full rescue
|
||||
# can still land a stop-monitoring command.
|
||||
#
|
||||
# Usage:
|
||||
# ./blind_stop.sh <host> [tcp_port]
|
||||
#
|
||||
# Env:
|
||||
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||
# Set to http://localhost:8001/api/sfm to route through
|
||||
# Terra-View's proxy.
|
||||
# MAX_ATTEMPTS Default: 600
|
||||
# SLEEP_S Default: 0 (no backoff — hammer it)
|
||||
# MAX_TIME_S Default: 15
|
||||
# CONNECT_TIMEOUT Default: 5
|
||||
# REPEAT Frames per TCP session (default 3 — increases hit rate
|
||||
# if the device is busy reading its own buffer).
|
||||
# STOP_ON_OK Default: 1. Set to 0 to keep hammering indefinitely
|
||||
# even after successful sends (every 503 means the device
|
||||
# is in *another* session, every 200 means our bytes got
|
||||
# through — but the device may not have processed them).
|
||||
|
||||
set -u
|
||||
|
||||
host="${1:-}"
|
||||
tcp_port="${2:-9034}"
|
||||
if [[ -z "$host" ]]; then
|
||||
echo "usage: $0 <host> [tcp_port]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||
max_attempts="${MAX_ATTEMPTS:-600}"
|
||||
sleep_s="${SLEEP_S:-0}"
|
||||
max_time_s="${MAX_TIME_S:-15}"
|
||||
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||
repeat="${REPEAT:-3}"
|
||||
stop_on_ok="${STOP_ON_OK:-1}"
|
||||
|
||||
url="${base}/device/stop_monitoring_blind?host=${host}&tcp_port=${tcp_port}&connect_timeout=${connect_timeout}&repeat=${repeat}"
|
||||
|
||||
echo "blind_stop: target ${host}:${tcp_port} connect_timeout=${connect_timeout}s repeat=${repeat}"
|
||||
echo "blind_stop: POST ${url}"
|
||||
echo "blind_stop: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
|
||||
echo "blind_stop: stop_on_ok=${stop_on_ok}"
|
||||
echo
|
||||
|
||||
ok_count=0
|
||||
busy_count=0
|
||||
err_count=0
|
||||
started=$(date +%s)
|
||||
|
||||
for ((i=1; i<=max_attempts; i++)); do
|
||||
printf "[%4d] %s " "$i" "$(date +%H:%M:%S)"
|
||||
http_code=$(curl -sS -o /tmp/blind_resp.$$ -w "%{http_code}" \
|
||||
--max-time "$max_time_s" \
|
||||
-X POST "$url" || echo "000")
|
||||
body=$(cat /tmp/blind_resp.$$ 2>/dev/null || true)
|
||||
rm -f /tmp/blind_resp.$$
|
||||
|
||||
case "$http_code" in
|
||||
200|201)
|
||||
ok_count=$((ok_count + 1))
|
||||
echo "SENT $body"
|
||||
if [[ "$stop_on_ok" == "1" ]]; then
|
||||
elapsed=$(( $(date +%s) - started ))
|
||||
echo
|
||||
echo "blind_stop: success after ${i} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}"
|
||||
echo "blind_stop: NEXT — wait ~10s, then try the full rescue:"
|
||||
echo " /home/serversdown/seismo-relay/scripts/rescue_device.sh ${host} ${tcp_port}"
|
||||
exit 0
|
||||
fi
|
||||
;;
|
||||
503)
|
||||
busy_count=$((busy_count + 1))
|
||||
echo "busy (503)"
|
||||
;;
|
||||
000)
|
||||
err_count=$((err_count + 1))
|
||||
echo "curl error"
|
||||
;;
|
||||
*)
|
||||
err_count=$((err_count + 1))
|
||||
echo "HTTP $http_code $body" | head -c 400
|
||||
echo
|
||||
;;
|
||||
esac
|
||||
[[ "$sleep_s" != "0" ]] && sleep "$sleep_s"
|
||||
done
|
||||
|
||||
elapsed=$(( $(date +%s) - started ))
|
||||
echo
|
||||
echo "blind_stop: gave up after ${max_attempts} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}" >&2
|
||||
exit 1
|
||||
Executable
+99
@@ -0,0 +1,99 @@
|
||||
#!/usr/bin/env bash
|
||||
# Rescue an uncooperative MiniMate that's busy with another ACH session.
|
||||
#
|
||||
# Hammers POST /device/rescue in a tight loop with a short timeout. When the
|
||||
# device is in an ACH session our SYN either gets refused or silently dropped
|
||||
# (5s connect timeout inside the endpoint) and we retry immediately. When the
|
||||
# device is between sessions, our TCP wins, the endpoint disables Auto Call
|
||||
# Home and erases events inside the same session, then returns success.
|
||||
#
|
||||
# Usage:
|
||||
# ./rescue_device.sh <host> [tcp_port] [--no-erase] [--no-disable-ach]
|
||||
#
|
||||
# Examples:
|
||||
# ./rescue_device.sh 166.246.130.1 9034
|
||||
# ./rescue_device.sh 166.246.130.1 9034 --no-erase # just silence it
|
||||
#
|
||||
# Environment:
|
||||
# SFM_BASE_URL Defaults to http://localhost:8200 (SFM direct).
|
||||
# Set to http://localhost:8001/api/sfm to route through
|
||||
# Terra-View's proxy. Direct mode avoids the proxy's
|
||||
# 60s timeout, which matters for long-running endpoints.
|
||||
# MAX_ATTEMPTS Cap on retries (default 600 ≈ 30+ min).
|
||||
# SLEEP_S Backoff between attempts (default 1).
|
||||
# MAX_TIME_S Per-request timeout (default 60).
|
||||
# CONNECT_TIMEOUT TCP connect timeout (default 5).
|
||||
# RECV_TIMEOUT Per-frame S3 recv timeout (default 5). If POLL or any
|
||||
# subsequent frame doesn't respond within this window, the
|
||||
# rescue endpoint bails and this script retries.
|
||||
|
||||
set -u
|
||||
|
||||
host="${1:-}"
|
||||
tcp_port="${2:-9034}"
|
||||
shift 2 2>/dev/null || shift $# 2>/dev/null
|
||||
|
||||
if [[ -z "$host" ]]; then
|
||||
echo "usage: $0 <host> [tcp_port] [--no-erase] [--no-disable-ach]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
disable_ach="true"
|
||||
erase="true"
|
||||
for arg in "$@"; do
|
||||
case "$arg" in
|
||||
--no-erase) erase="false" ;;
|
||||
--no-disable-ach) disable_ach="false" ;;
|
||||
*) echo "unknown flag: $arg" >&2; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||
max_attempts="${MAX_ATTEMPTS:-600}"
|
||||
sleep_s="${SLEEP_S:-1}"
|
||||
max_time_s="${MAX_TIME_S:-60}"
|
||||
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||
recv_timeout="${RECV_TIMEOUT:-5}"
|
||||
|
||||
url="${base}/device/rescue?host=${host}&tcp_port=${tcp_port}&disable_ach=${disable_ach}&erase=${erase}&connect_timeout=${connect_timeout}&recv_timeout=${recv_timeout}"
|
||||
|
||||
echo "rescue: target ${host}:${tcp_port} disable_ach=${disable_ach} erase=${erase}"
|
||||
echo "rescue: connect_timeout=${connect_timeout}s recv_timeout=${recv_timeout}s"
|
||||
echo "rescue: POST ${url}"
|
||||
echo "rescue: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
|
||||
echo
|
||||
|
||||
started=$(date +%s)
|
||||
for ((i=1; i<=max_attempts; i++)); do
|
||||
printf "[%3d] %s " "$i" "$(date +%H:%M:%S)"
|
||||
http_code=$(curl -sS -o /tmp/rescue_resp.$$ -w "%{http_code}" \
|
||||
--max-time "$max_time_s" \
|
||||
-X POST "$url" || echo "000")
|
||||
body=$(cat /tmp/rescue_resp.$$ 2>/dev/null || true)
|
||||
rm -f /tmp/rescue_resp.$$
|
||||
|
||||
case "$http_code" in
|
||||
200|201)
|
||||
elapsed=$(( $(date +%s) - started ))
|
||||
echo "OK (${elapsed}s total)"
|
||||
echo "$body"
|
||||
exit 0
|
||||
;;
|
||||
503)
|
||||
# Connection refused / timeout — device busy in another session. Retry fast.
|
||||
echo "busy (503)"
|
||||
;;
|
||||
000)
|
||||
echo "curl error (network)"
|
||||
;;
|
||||
*)
|
||||
echo "HTTP $http_code"
|
||||
echo " $body" | head -c 400
|
||||
echo
|
||||
;;
|
||||
esac
|
||||
sleep "$sleep_s"
|
||||
done
|
||||
|
||||
echo "rescue: gave up after ${max_attempts} attempts" >&2
|
||||
exit 1
|
||||
Executable
+44
@@ -0,0 +1,44 @@
|
||||
#!/usr/bin/env bash
|
||||
# Hold a single TCP session open and drip stop-monitoring frames at a slow
|
||||
# rate, so the device's UART RX FIFO has time to drain between sends.
|
||||
#
|
||||
# Use when high-rate spam isn't landing — typically because the device's
|
||||
# firmware is too busy to drain its serial buffer fast enough and bytes
|
||||
# are being lost to UART overrun.
|
||||
#
|
||||
# Usage:
|
||||
# ./slow_drip.sh <host> [tcp_port] [duration_s]
|
||||
#
|
||||
# Env:
|
||||
# DURATION Default: 120 (seconds; arg 3 overrides). Clamped 1..600.
|
||||
# INTERVAL Seconds between drip sends (default 3). Lower = more
|
||||
# aggressive, more risk of FIFO overrun. Higher = safer
|
||||
# but fewer total drips per duration.
|
||||
# CONNECT_TIMEOUT Default: 5
|
||||
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||
|
||||
set -u
|
||||
|
||||
host="${1:-}"
|
||||
tcp_port="${2:-9034}"
|
||||
duration="${3:-${DURATION:-120}}"
|
||||
if [[ -z "$host" ]]; then
|
||||
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||
interval="${INTERVAL:-3}"
|
||||
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||
|
||||
url="${base}/device/stop_monitoring_slow_drip?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&interval_s=${interval}&connect_timeout=${connect_timeout}"
|
||||
|
||||
echo "slow_drip: target ${host}:${tcp_port} duration=${duration}s interval=${interval}s connect_timeout=${connect_timeout}s"
|
||||
echo "slow_drip: POST ${url}"
|
||||
echo
|
||||
|
||||
# Give curl enough slack to wait out the duration plus a buffer
|
||||
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 30 }')
|
||||
|
||||
curl -sS --max-time "$max_time" -X POST "$url"
|
||||
echo
|
||||
Executable
+48
@@ -0,0 +1,48 @@
|
||||
#!/usr/bin/env bash
|
||||
# Hammer a device with blind stop-monitoring sessions as fast as possible.
|
||||
# Single HTTP call kicks off the burst inside SFM (no per-attempt HTTP
|
||||
# overhead). Default: 10 seconds, ~500 ms per attempt = ~20 attempts/sec.
|
||||
#
|
||||
# Usage:
|
||||
# ./spam_stop.sh <host> [tcp_port] [duration_s]
|
||||
#
|
||||
# Examples:
|
||||
# ./spam_stop.sh 166.246.130.1 # 10s burst
|
||||
# ./spam_stop.sh 166.246.130.1 9034 30 # 30s burst
|
||||
# DURATION=60 CONNECT_TIMEOUT=0.2 ./spam_stop.sh 166.246.130.1
|
||||
#
|
||||
# Env:
|
||||
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||
# Set to http://localhost:8001/api/sfm to route through
|
||||
# Terra-View's proxy — but note the proxy has a 60s
|
||||
# timeout, so long bursts need direct mode.
|
||||
# DURATION Default: 10 (seconds; arg 3 overrides)
|
||||
# CONNECT_TIMEOUT Default: 0.5 (seconds)
|
||||
# REPEAT Default: 3 (stop frames per TCP session)
|
||||
|
||||
set -u
|
||||
|
||||
host="${1:-}"
|
||||
tcp_port="${2:-9034}"
|
||||
duration="${3:-${DURATION:-10}}"
|
||||
|
||||
if [[ -z "$host" ]]; then
|
||||
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||
connect_timeout="${CONNECT_TIMEOUT:-0.5}"
|
||||
repeat="${REPEAT:-3}"
|
||||
|
||||
url="${base}/device/stop_monitoring_spam?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&connect_timeout=${connect_timeout}&repeat=${repeat}"
|
||||
|
||||
echo "spam_stop: target ${host}:${tcp_port} duration=${duration}s connect_timeout=${connect_timeout}s repeat=${repeat}"
|
||||
echo "spam_stop: POST ${url}"
|
||||
echo
|
||||
|
||||
# Give curl enough slack to wait out the duration plus a buffer
|
||||
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 10 }')
|
||||
|
||||
curl -sS --max-time "$max_time" -X POST "$url"
|
||||
echo
|
||||
Executable
+58
@@ -0,0 +1,58 @@
|
||||
#!/usr/bin/env bash
|
||||
# Passive monitor for a misbehaving unit. Every INTERVAL seconds, attempts
|
||||
# a single short TCP probe + storage_range read and logs the result. Designed
|
||||
# to run unattended for hours/days and tell you when the unit comes back.
|
||||
#
|
||||
# Usage:
|
||||
# ./watch_unit.sh <host> [tcp_port]
|
||||
#
|
||||
# Env:
|
||||
# INTERVAL Seconds between checks (default 300 = 5 min)
|
||||
# LOG_FILE Append results here (default /tmp/watch_<host>.log)
|
||||
# SFM_BASE_URL Default: http://localhost:8200
|
||||
|
||||
set -u
|
||||
|
||||
host="${1:-}"
|
||||
tcp_port="${2:-9034}"
|
||||
if [[ -z "$host" ]]; then
|
||||
echo "usage: $0 <host> [tcp_port]" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
interval="${INTERVAL:-300}"
|
||||
log_file="${LOG_FILE:-/tmp/watch_${host}.log}"
|
||||
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||
|
||||
url="${base}/device/events/storage_range?host=${host}&tcp_port=${tcp_port}"
|
||||
|
||||
echo "watch_unit: target ${host}:${tcp_port} interval=${interval}s log=${log_file}"
|
||||
echo "watch_unit: Ctrl-C to stop"
|
||||
|
||||
while true; do
|
||||
ts=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
http_code=$(curl -sS -o /tmp/watch_resp.$$ -w "%{http_code}" \
|
||||
--max-time 20 "$url" || echo "000")
|
||||
body=$(cat /tmp/watch_resp.$$ 2>/dev/null || true)
|
||||
rm -f /tmp/watch_resp.$$
|
||||
|
||||
case "$http_code" in
|
||||
200|201)
|
||||
# Strip the raw_hex for readability
|
||||
summary=$(echo "$body" | sed 's/"raw_hex":"[^"]*",*//; s/,*$//' | head -c 200)
|
||||
echo "$ts REACHABLE $summary" | tee -a "$log_file"
|
||||
;;
|
||||
502|503)
|
||||
err=$(echo "$body" | head -c 150)
|
||||
echo "$ts ERROR_$http_code $err" | tee -a "$log_file"
|
||||
;;
|
||||
000)
|
||||
echo "$ts CURL_FAIL (network/timeout)" | tee -a "$log_file"
|
||||
;;
|
||||
*)
|
||||
echo "$ts HTTP_$http_code $(echo "$body" | head -c 150)" | tee -a "$log_file"
|
||||
;;
|
||||
esac
|
||||
|
||||
sleep "$interval"
|
||||
done
|
||||
@@ -491,6 +491,75 @@ class SeismoDb:
|
||||
)
|
||||
return cur.rowcount > 0
|
||||
|
||||
def delete_event(self, event_id: str) -> Optional[dict]:
|
||||
"""
|
||||
Hard-delete one event row by id. Returns the deleted row (so the
|
||||
caller can clean up any on-disk files referenced by it) or None
|
||||
if no row matched.
|
||||
"""
|
||||
with self._connect() as conn:
|
||||
row = conn.execute(
|
||||
"SELECT * FROM events WHERE id = ?", (event_id,),
|
||||
).fetchone()
|
||||
if row is None:
|
||||
return None
|
||||
conn.execute("DELETE FROM events WHERE id = ?", (event_id,))
|
||||
return dict(row)
|
||||
|
||||
def delete_events_bulk(
|
||||
self,
|
||||
serial: Optional[str] = None,
|
||||
from_dt: Optional[datetime.datetime] = None,
|
||||
to_dt: Optional[datetime.datetime] = None,
|
||||
false_trigger: Optional[bool] = None,
|
||||
ids: Optional[list[str]] = None,
|
||||
) -> list[dict]:
|
||||
"""
|
||||
Hard-delete events matching the given filters. Returns the list
|
||||
of deleted row dicts. Refuses to delete with no filters at all
|
||||
(would wipe the whole table) — raises ValueError.
|
||||
|
||||
Filter semantics match query_events: serial / from_dt / to_dt /
|
||||
false_trigger combine with AND. `ids` is an additional inclusion
|
||||
list (event_id IN (...)); if supplied alongside other filters,
|
||||
only rows matching all conditions are deleted.
|
||||
"""
|
||||
clauses: list[str] = []
|
||||
params: list = []
|
||||
|
||||
if serial:
|
||||
clauses.append("serial = ?")
|
||||
params.append(serial)
|
||||
if from_dt:
|
||||
clauses.append("timestamp >= ?")
|
||||
params.append(from_dt.isoformat())
|
||||
if to_dt:
|
||||
clauses.append("timestamp <= ?")
|
||||
params.append(to_dt.isoformat())
|
||||
if false_trigger is not None:
|
||||
clauses.append("false_trigger = ?")
|
||||
params.append(1 if false_trigger else 0)
|
||||
if ids:
|
||||
placeholders = ",".join("?" * len(ids))
|
||||
clauses.append(f"id IN ({placeholders})")
|
||||
params.extend(ids)
|
||||
|
||||
if not clauses:
|
||||
raise ValueError(
|
||||
"delete_events_bulk refuses to delete with no filters "
|
||||
"(would wipe the entire events table)"
|
||||
)
|
||||
|
||||
where = "WHERE " + " AND ".join(clauses)
|
||||
|
||||
with self._connect() as conn:
|
||||
rows = conn.execute(
|
||||
f"SELECT * FROM events {where}", params,
|
||||
).fetchall()
|
||||
if rows:
|
||||
conn.execute(f"DELETE FROM events {where}", params)
|
||||
return [dict(r) for r in rows]
|
||||
|
||||
def update_event_review(self, event_id: str, review: dict) -> bool:
|
||||
"""
|
||||
Sync derived index columns from a sidecar's `review` block.
|
||||
|
||||
+720
-2
@@ -36,6 +36,7 @@ from __future__ import annotations
|
||||
|
||||
import datetime
|
||||
import logging
|
||||
import socket
|
||||
import sys
|
||||
import tempfile
|
||||
import threading
|
||||
@@ -63,7 +64,9 @@ from minimateplus.protocol import ProtocolError
|
||||
from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp
|
||||
from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT
|
||||
from minimateplus.blastware_file import write_blastware_file, blastware_filename
|
||||
from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform
|
||||
from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform, _decode_event_count
|
||||
from minimateplus.framing import build_bw_write_frame, SESSION_RESET, POLL_PROBE, POLL_DATA
|
||||
from minimateplus.protocol import SUB_STOP_MONITORING
|
||||
from sfm import event_hdf5
|
||||
from sfm.cache import SFMCache, get_cache
|
||||
from sfm.database import SeismoDb
|
||||
@@ -268,7 +271,8 @@ def _build_client(
|
||||
baud: int,
|
||||
host: Optional[str],
|
||||
tcp_port: int,
|
||||
timeout: float = 30.0,
|
||||
timeout: float = 10.0,
|
||||
connect_timeout: Optional[float] = None,
|
||||
) -> MiniMateClient:
|
||||
"""
|
||||
Return a MiniMateClient configured for either serial or TCP transport.
|
||||
@@ -276,11 +280,23 @@ def _build_client(
|
||||
TCP takes priority if *host* is supplied; otherwise *port* (serial) is used.
|
||||
Raises HTTPException(422) if neither is provided.
|
||||
|
||||
Default *timeout* is 10s — the device usually responds in well under a
|
||||
second over cellular; 10s leaves comfortable headroom for retransmits
|
||||
while still failing reasonably fast when a unit is wedged.
|
||||
|
||||
Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform
|
||||
download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over
|
||||
cellular and each individual recv must complete within the timeout window.
|
||||
|
||||
*connect_timeout* (TCP only) overrides the TcpTransport default (10s) for
|
||||
the initial TCP SYN/handshake. Use a small value (e.g. 5s) in rescue/race
|
||||
scenarios where the device is busy in another session and you want to
|
||||
fail fast and retry quickly.
|
||||
"""
|
||||
if host:
|
||||
if connect_timeout is not None:
|
||||
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
|
||||
else:
|
||||
transport = TcpTransport(host, port=tcp_port)
|
||||
log.debug("TCP transport: %s:%d timeout=%.0fs", host, tcp_port, timeout)
|
||||
return MiniMateClient(transport=transport, timeout=timeout)
|
||||
@@ -1095,6 +1111,7 @@ def device_monitor_status(
|
||||
cached["_cached"] = True
|
||||
return cached
|
||||
|
||||
try:
|
||||
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||
try:
|
||||
client.poll()
|
||||
@@ -1102,6 +1119,15 @@ def device_monitor_status(
|
||||
log.warning("monitor status poll retry: %s", exc)
|
||||
client.poll()
|
||||
status = client.get_monitor_status()
|
||||
except HTTPException:
|
||||
raise
|
||||
except ProtocolError as exc:
|
||||
# Includes minimateplus.protocol.TimeoutError ("device unresponsive").
|
||||
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||
|
||||
result: dict = {"is_monitoring": status.is_monitoring}
|
||||
if status.battery_v is not None:
|
||||
@@ -1117,6 +1143,529 @@ def device_monitor_status(
|
||||
return result
|
||||
|
||||
|
||||
@app.get("/device/events/storage_range")
|
||||
def device_events_storage_range(
|
||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||
baud: int = Query(38400, description="Serial baud rate"),
|
||||
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
) -> dict:
|
||||
"""
|
||||
Read the device's event storage range (SUB 0x06) — first and last
|
||||
stored event keys. POLL handshake + one read; no connect(), no
|
||||
config reads, no event walk. Completes in ~2 seconds.
|
||||
|
||||
Useful for checking whether the device has any stored events
|
||||
without invoking the slow count_events() 1E/1F chain. Both keys =
|
||||
`01110000` means the device is empty.
|
||||
"""
|
||||
log.info("GET /device/events/storage_range host=%s tcp_port=%s", host, tcp_port)
|
||||
try:
|
||||
def _do():
|
||||
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||
try:
|
||||
client.poll()
|
||||
except Exception as exc:
|
||||
log.warning("storage_range poll retry: %s", exc)
|
||||
client.poll()
|
||||
proto = client._require_proto()
|
||||
return proto.read_event_storage_range()
|
||||
rng = _run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||
except HTTPException:
|
||||
raise
|
||||
except ProtocolError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||
|
||||
data = bytes(rng.data)
|
||||
result: dict = {"raw_len": len(data), "raw_hex": data.hex()}
|
||||
if len(data) >= 8:
|
||||
first_key = data[-8:-4].hex()
|
||||
last_key = data[-4:].hex()
|
||||
result["first_key"] = first_key
|
||||
result["last_key"] = last_key
|
||||
result["is_empty"] = (first_key == "01110000" and last_key == "01110000")
|
||||
return result
|
||||
|
||||
|
||||
@app.get("/device/events/index")
|
||||
def device_events_index(
|
||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||
baud: int = Query(38400, description="Serial baud rate"),
|
||||
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
) -> dict:
|
||||
"""
|
||||
Read the device's event index (SUB 0x08) — returns the lifetime
|
||||
event counter at data[10:12] (uint16 BE). POLL handshake + one
|
||||
read; no connect(), no config reads, no event walk. ~2 seconds.
|
||||
|
||||
Note: this is a LIFETIME counter (events ever recorded) — it does
|
||||
NOT decrement when events are erased. After an erase, the device
|
||||
counter resets to 0 only on the next recorded event. For "are
|
||||
there stored events right now?" use /device/events/storage_range
|
||||
instead.
|
||||
"""
|
||||
log.info("GET /device/events/index host=%s tcp_port=%s", host, tcp_port)
|
||||
try:
|
||||
def _do():
|
||||
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||
try:
|
||||
client.poll()
|
||||
except Exception as exc:
|
||||
log.warning("event_index poll retry: %s", exc)
|
||||
client.poll()
|
||||
proto = client._require_proto()
|
||||
return proto.read_event_index()
|
||||
idx_raw = _run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||
except HTTPException:
|
||||
raise
|
||||
except ProtocolError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||
|
||||
raw = bytes(idx_raw)
|
||||
result: dict = {"raw_len": len(raw), "raw_hex": raw.hex()}
|
||||
try:
|
||||
result["lifetime_count"] = _decode_event_count(raw)
|
||||
except Exception as exc:
|
||||
result["decode_error"] = str(exc)
|
||||
return result
|
||||
|
||||
|
||||
@app.post("/device/events/erase")
|
||||
def device_events_erase(
|
||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||
baud: int = Query(38400, description="Serial baud rate"),
|
||||
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
) -> dict:
|
||||
"""
|
||||
Erase ALL stored events from the device memory.
|
||||
|
||||
Sequence: SUB 0xA3 → 0x1C → 0x06 → 0xA2 (confirmed 2026-04-11).
|
||||
After this call the unit's event memory is empty and event keys reset
|
||||
to 0x01110000. The device returns to its normal operating state
|
||||
automatically — no restart-monitoring call is needed.
|
||||
|
||||
Note: this endpoint does NOT touch the ACH server's `ach_state.json`.
|
||||
If a call-home subsequently lands on the ACH server, its post-erase
|
||||
detection logic (max(device_keys) vs max_downloaded_key) handles the
|
||||
key-counter rollback.
|
||||
"""
|
||||
log.info("POST /device/events/erase port=%s host=%s tcp_port=%s", port, host, tcp_port)
|
||||
|
||||
try:
|
||||
def _do():
|
||||
with _build_client(port, baud, host, tcp_port) as client:
|
||||
client.connect()
|
||||
client.delete_all_events()
|
||||
_run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||
except HTTPException:
|
||||
raise
|
||||
except ProtocolError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||
except Exception as exc:
|
||||
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||
|
||||
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
|
||||
cleared = get_cache().clear_device(conn_key)
|
||||
return {
|
||||
"status": "ok",
|
||||
"message": "Device event memory cleared",
|
||||
"cache_cleared": cleared,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/device/stop_monitoring_blind")
|
||||
def device_stop_monitoring_blind(
|
||||
host: str = Query(..., description="TCP host — modem IP"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
|
||||
repeat: int = Query(3, description="How many times to send the frame within one TCP session (default 3)"),
|
||||
) -> dict:
|
||||
"""
|
||||
Fire-and-forget Stop Monitoring (SUB 0x97). TCP-only.
|
||||
|
||||
Opens a TCP session, dumps the FULL handshake the device's protocol
|
||||
state machine expects — `SESSION_RESET + POLL_PROBE + SESSION_RESET +
|
||||
POLL_DATA` — and then N back-to-back copies of the stop-monitoring
|
||||
frame. Does NOT read any S3 response. Succeeds as long as the bytes
|
||||
left the socket.
|
||||
|
||||
The POLL handshake bytes are required: monitoring units ignore command
|
||||
frames received without a preceding POLL exchange. Sending the POLL
|
||||
bytes "blind" (without reading the responses) still works because the
|
||||
device processes inbound bytes in order regardless of whether we drain
|
||||
its outbound buffer.
|
||||
|
||||
Idempotent: the device processes extra copies of SUB 0x97 the same as
|
||||
one (already-stopped is a no-op).
|
||||
|
||||
Returns the number of bytes sent. A 503 means the TCP connect failed
|
||||
(device busy in another session — caller should retry).
|
||||
"""
|
||||
log.info(
|
||||
"POST /device/stop_monitoring_blind host=%s tcp_port=%s connect_timeout=%.1fs repeat=%d",
|
||||
host, tcp_port, connect_timeout, repeat,
|
||||
)
|
||||
if repeat < 1:
|
||||
repeat = 1
|
||||
|
||||
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||
payload = (
|
||||
SESSION_RESET + POLL_PROBE
|
||||
+ SESSION_RESET + POLL_DATA
|
||||
+ (frame * repeat)
|
||||
)
|
||||
t0 = time.monotonic()
|
||||
|
||||
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
|
||||
try:
|
||||
transport.connect()
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
|
||||
|
||||
try:
|
||||
transport.write(payload)
|
||||
except OSError as exc:
|
||||
transport.disconnect()
|
||||
raise HTTPException(status_code=502, detail=f"Send error: {exc}") from exc
|
||||
finally:
|
||||
transport.disconnect()
|
||||
|
||||
return {
|
||||
"status": "sent",
|
||||
"bytes_sent": len(payload),
|
||||
"frame_size": len(frame),
|
||||
"repeat": repeat,
|
||||
"elapsed_s": round(time.monotonic() - t0, 3),
|
||||
}
|
||||
|
||||
|
||||
@app.post("/device/stop_monitoring_slow_drip")
|
||||
def device_stop_monitoring_slow_drip(
|
||||
host: str = Query(..., description="TCP host — modem IP"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
duration_s: float = Query(120.0, description="Total time to hold the session open (seconds)"),
|
||||
interval_s: float = Query(3.0, description="Seconds between drip sends"),
|
||||
connect_timeout: float = Query(5.0, description="TCP connect timeout"),
|
||||
) -> dict:
|
||||
"""
|
||||
Hold a single TCP session open for *duration_s* seconds and drip
|
||||
stop-monitoring frames into the device at a slow rate so its UART
|
||||
RX FIFO has time to drain between sends.
|
||||
|
||||
Sequence:
|
||||
1. Open TCP session.
|
||||
2. Send the wake preamble: SESSION_RESET + POLL_PROBE +
|
||||
SESSION_RESET + POLL_DATA (so the device's protocol parser
|
||||
is primed for a write command).
|
||||
3. Wait interval_s for the device to drain.
|
||||
4. Drip-send (SESSION_RESET + stop_monitoring_frame) every
|
||||
interval_s until duration_s elapses.
|
||||
5. Opportunistically drain any bytes the device sends back (so
|
||||
the modem's TX queue doesn't fill up). Successful drains are
|
||||
counted in `bytes_received` — non-zero strongly suggests the
|
||||
device has started responding to us.
|
||||
6. Close.
|
||||
|
||||
Designed for units whose firmware is too busy with event-recording
|
||||
to keep up with high-rate spam. Heavy spam overruns the UART FIFO;
|
||||
slow drip stays under it.
|
||||
|
||||
Compared to spam mode: ~40× fewer bytes/sec on the wire, but each
|
||||
byte has a much higher chance of actually being parsed.
|
||||
"""
|
||||
log.info(
|
||||
"POST /device/stop_monitoring_slow_drip host=%s tcp_port=%s duration=%.1fs interval=%.2fs connect_timeout=%.1fs",
|
||||
host, tcp_port, duration_s, interval_s, connect_timeout,
|
||||
)
|
||||
duration_s = max(1.0, min(duration_s, 600.0)) # clamp 1s..10min
|
||||
interval_s = max(0.1, min(interval_s, 30.0))
|
||||
connect_timeout = max(0.1, connect_timeout)
|
||||
|
||||
stop_frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||
preamble = (
|
||||
SESSION_RESET + POLL_PROBE
|
||||
+ SESSION_RESET + POLL_DATA
|
||||
)
|
||||
|
||||
t0 = time.monotonic()
|
||||
drips_sent = 0
|
||||
bytes_sent = 0
|
||||
bytes_received = 0
|
||||
|
||||
try:
|
||||
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
|
||||
|
||||
# Short read timeout so opportunistic drains don't block.
|
||||
sock.settimeout(0.1)
|
||||
|
||||
try:
|
||||
# Initial wake preamble.
|
||||
try:
|
||||
sock.sendall(preamble)
|
||||
bytes_sent += len(preamble)
|
||||
except OSError as exc:
|
||||
raise HTTPException(status_code=502, detail=f"Preamble send failed: {exc}") from exc
|
||||
|
||||
# Initial settle.
|
||||
time.sleep(interval_s)
|
||||
|
||||
# Try a non-blocking drain of any response to the wake.
|
||||
try:
|
||||
data = sock.recv(4096)
|
||||
if data:
|
||||
bytes_received += len(data)
|
||||
log.info("slow_drip: device responded to wake preamble (%d bytes)", len(data))
|
||||
except socket.timeout:
|
||||
pass
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
deadline = t0 + duration_s
|
||||
drip = SESSION_RESET + stop_frame # 2 + 21 = 23 bytes per drip
|
||||
send_error: Optional[str] = None
|
||||
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
sock.sendall(drip)
|
||||
bytes_sent += len(drip)
|
||||
drips_sent += 1
|
||||
except OSError as exc:
|
||||
send_error = f"{exc}"
|
||||
log.warning("slow_drip: send failed after %d drips: %s", drips_sent, exc)
|
||||
break
|
||||
|
||||
# Drain any inbound bytes; ignore timeouts.
|
||||
try:
|
||||
data = sock.recv(4096)
|
||||
if data:
|
||||
bytes_received += len(data)
|
||||
except socket.timeout:
|
||||
pass
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
# Sleep the interval, but don't oversleep past the deadline.
|
||||
remaining = deadline - time.monotonic()
|
||||
if remaining <= 0:
|
||||
break
|
||||
time.sleep(min(interval_s, remaining))
|
||||
finally:
|
||||
try:
|
||||
sock.shutdown(socket.SHUT_RDWR)
|
||||
except OSError:
|
||||
pass
|
||||
try:
|
||||
sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
elapsed = time.monotonic() - t0
|
||||
log.info(
|
||||
"slow_drip done — drips=%d bytes_sent=%d bytes_received=%d in %.1fs",
|
||||
drips_sent, bytes_sent, bytes_received, elapsed,
|
||||
)
|
||||
return {
|
||||
"status": "done",
|
||||
"duration_s": round(elapsed, 2),
|
||||
"drips_sent": drips_sent,
|
||||
"bytes_sent": bytes_sent,
|
||||
"bytes_received": bytes_received,
|
||||
"preamble_bytes": len(preamble),
|
||||
"drip_bytes": len(drip),
|
||||
"send_error": send_error,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/device/stop_monitoring_spam")
|
||||
def device_stop_monitoring_spam(
|
||||
host: str = Query(..., description="TCP host — modem IP"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
duration_s: float = Query(10.0, description="How long to hammer the device for (seconds)"),
|
||||
connect_timeout: float = Query(0.5, description="Per-attempt TCP connect timeout (default 0.5s)"),
|
||||
repeat: int = Query(3, description="Stop frames per TCP session (default 3)"),
|
||||
) -> dict:
|
||||
"""
|
||||
Hammer the device with blind stop-monitoring sessions as fast as
|
||||
possible for `duration_s` seconds. Each attempt: open TCP → write
|
||||
SESSION_RESET + POLL handshake + STOP frames × repeat → close. No
|
||||
response is read.
|
||||
|
||||
Designed for units that are aggressively calling home — short
|
||||
connect_timeout (default 500 ms) means every failed attempt loses
|
||||
only that much time before retrying, so we can fit several attempts
|
||||
per second even when the modem is mostly busy with its own outbound
|
||||
sessions.
|
||||
|
||||
Single HTTP call kicks off the whole burst; counters are returned
|
||||
when it finishes. No streaming; if you want live progress, watch
|
||||
SFM logs.
|
||||
"""
|
||||
log.info(
|
||||
"POST /device/stop_monitoring_spam host=%s tcp_port=%s duration=%.1fs connect_timeout=%.3fs repeat=%d",
|
||||
host, tcp_port, duration_s, connect_timeout, repeat,
|
||||
)
|
||||
if repeat < 1:
|
||||
repeat = 1
|
||||
duration_s = max(0.1, min(duration_s, 300.0)) # clamp 0.1s..5min
|
||||
connect_timeout = max(0.05, connect_timeout)
|
||||
|
||||
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||
payload = (
|
||||
SESSION_RESET + POLL_PROBE
|
||||
+ SESSION_RESET + POLL_DATA
|
||||
+ (frame * repeat)
|
||||
)
|
||||
|
||||
t0 = time.monotonic()
|
||||
deadline = t0 + duration_s
|
||||
sent_ok = 0
|
||||
connect_failed = 0
|
||||
write_failed = 0
|
||||
|
||||
while time.monotonic() < deadline:
|
||||
try:
|
||||
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
|
||||
except OSError:
|
||||
connect_failed += 1
|
||||
continue
|
||||
try:
|
||||
sock.sendall(payload)
|
||||
sent_ok += 1
|
||||
except OSError:
|
||||
write_failed += 1
|
||||
finally:
|
||||
try:
|
||||
sock.shutdown(socket.SHUT_RDWR)
|
||||
except OSError:
|
||||
pass
|
||||
try:
|
||||
sock.close()
|
||||
except OSError:
|
||||
pass
|
||||
|
||||
elapsed = time.monotonic() - t0
|
||||
total = sent_ok + connect_failed + write_failed
|
||||
log.info(
|
||||
"stop_monitoring_spam done — sent=%d connect_failed=%d write_failed=%d in %.2fs",
|
||||
sent_ok, connect_failed, write_failed, elapsed,
|
||||
)
|
||||
return {
|
||||
"status": "done",
|
||||
"duration_s": round(elapsed, 2),
|
||||
"sent_ok": sent_ok,
|
||||
"connect_failed": connect_failed,
|
||||
"write_failed": write_failed,
|
||||
"total_attempts": total,
|
||||
"rate_attempts_per_s": round(total / elapsed, 1) if elapsed > 0 else 0,
|
||||
"payload_bytes": len(payload),
|
||||
}
|
||||
|
||||
|
||||
@app.post("/device/rescue")
|
||||
def device_rescue(
|
||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||
baud: int = Query(38400, description="Serial baud rate"),
|
||||
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
|
||||
recv_timeout: float = Query(5.0, description="Per-frame S3 recv timeout in seconds (default 5)"),
|
||||
disable_ach: bool = Query(True, description="Disable Auto Call Home on the device before erasing"),
|
||||
erase: bool = Query(True, description="Erase all stored events after disabling ACH"),
|
||||
) -> dict:
|
||||
"""
|
||||
Rescue an uncooperative unit by squeezing all maintenance work into a
|
||||
single TCP session.
|
||||
|
||||
Designed for devices that are actively calling home to a separate ACH
|
||||
server (BW or otherwise). While we hold this TCP session open the
|
||||
modem cannot accept an inbound ACH call, so the order matters:
|
||||
|
||||
1. Short-timeout TCP connect (fails fast if the device is busy in
|
||||
another session — the caller should retry in a tight loop).
|
||||
2. POLL handshake.
|
||||
3. (optional) Write call_home config with auto_call_home_enabled=false
|
||||
so the device stops calling out even after we drop the session.
|
||||
4. (optional) Erase all stored events (0xA3 → 0x1C → 0x06 → 0xA2).
|
||||
5. Close the TCP session.
|
||||
|
||||
Both `disable_ach` and `erase` default to true. Pass `?erase=false` if
|
||||
you only want to silence the unit without wiping its events.
|
||||
|
||||
Caller pattern (bash):
|
||||
|
||||
until curl -sS --max-time 30 -X POST \\
|
||||
"http://localhost:8001/api/sfm/device/rescue?host=$IP&tcp_port=$P"; do
|
||||
sleep 1
|
||||
done
|
||||
"""
|
||||
log.info(
|
||||
"POST /device/rescue host=%s tcp_port=%s connect_timeout=%.1fs recv_timeout=%.1fs disable_ach=%s erase=%s",
|
||||
host, tcp_port, connect_timeout, recv_timeout, disable_ach, erase,
|
||||
)
|
||||
|
||||
steps: list[dict] = []
|
||||
t0 = time.monotonic()
|
||||
|
||||
try:
|
||||
with _build_client(
|
||||
port, baud, host, tcp_port,
|
||||
timeout=recv_timeout,
|
||||
connect_timeout=connect_timeout,
|
||||
) as client:
|
||||
steps.append({"step": "tcp_connect", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||
|
||||
try:
|
||||
client.poll()
|
||||
except Exception as exc:
|
||||
log.warning("rescue: poll retry: %s", exc)
|
||||
client.poll()
|
||||
steps.append({"step": "poll", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||
|
||||
if disable_ach:
|
||||
client.set_call_home_config(auto_call_home_enabled=False)
|
||||
steps.append({"step": "disable_ach", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||
|
||||
if erase:
|
||||
client.delete_all_events()
|
||||
steps.append({"step": "erase", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||
|
||||
except ProtocolError as exc:
|
||||
steps.append({"step": "error", "ok": False, "detail": f"protocol: {exc}"})
|
||||
raise HTTPException(status_code=502, detail={"message": f"Protocol error: {exc}", "steps": steps}) from exc
|
||||
except OSError as exc:
|
||||
steps.append({"step": "error", "ok": False, "detail": f"socket: {exc}"})
|
||||
# Connection refused / timed out → device busy in another session. Caller should retry.
|
||||
raise HTTPException(status_code=503, detail={"message": f"Connection error: {exc}", "steps": steps}) from exc
|
||||
except Exception as exc:
|
||||
steps.append({"step": "error", "ok": False, "detail": str(exc)})
|
||||
raise HTTPException(status_code=500, detail={"message": f"Device error: {exc}", "steps": steps}) from exc
|
||||
|
||||
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
|
||||
cleared = get_cache().clear_device(conn_key)
|
||||
return {
|
||||
"status": "ok",
|
||||
"elapsed_s": round(time.monotonic() - t0, 2),
|
||||
"disable_ach": disable_ach,
|
||||
"erase": erase,
|
||||
"steps": steps,
|
||||
"cache_cleared": cleared,
|
||||
}
|
||||
|
||||
|
||||
@app.post("/device/monitor/start")
|
||||
def device_monitor_start(
|
||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||
@@ -1403,6 +1952,175 @@ def db_set_false_trigger(
|
||||
return {"status": "ok", "event_id": event_id, "false_trigger": value}
|
||||
|
||||
|
||||
def _cleanup_event_files(row: dict) -> dict:
|
||||
"""
|
||||
Best-effort cleanup of on-disk waveform / sidecar / pickle / hdf5 files
|
||||
associated with a deleted event row. Returns a dict of {kind: bool} for
|
||||
what was actually removed (true) vs not found / failed (false).
|
||||
"""
|
||||
serial = row.get("serial")
|
||||
bw_name = row.get("blastware_filename")
|
||||
a5_name = row.get("a5_pickle_filename")
|
||||
sc_name = row.get("sidecar_filename")
|
||||
removed: dict = {}
|
||||
if not serial:
|
||||
return removed
|
||||
store = _get_store()
|
||||
# blastware_filename is the "base" — other files derive their paths from it
|
||||
# via WaveformStore helpers. Sidecar and a5 may also be stored under their
|
||||
# own column values if they ever diverged historically.
|
||||
base_name = bw_name or a5_name or sc_name
|
||||
if base_name:
|
||||
bw_path, a5_path = store.paths_for(serial, base_name)
|
||||
sc_path = store.sidecar_path_for(serial, base_name)
|
||||
h5_path = store.hdf5_path_for(serial, base_name)
|
||||
for kind, p in [("blastware", bw_path), ("a5_pickle", a5_path),
|
||||
("sidecar", sc_path), ("hdf5", h5_path)]:
|
||||
try:
|
||||
if p.exists():
|
||||
p.unlink()
|
||||
removed[kind] = True
|
||||
except OSError as exc:
|
||||
log.warning("file cleanup failed for %s (%s): %s", p, kind, exc)
|
||||
removed[kind] = False
|
||||
return removed
|
||||
|
||||
|
||||
@app.delete("/db/events/{event_id}")
|
||||
def db_delete_event(event_id: str) -> dict:
|
||||
"""
|
||||
Hard-delete a single event from the SFM events table and remove any
|
||||
associated on-disk waveform/sidecar/pickle/hdf5 files.
|
||||
|
||||
Returns 404 if the event_id is not found.
|
||||
"""
|
||||
log.info("DELETE /db/events/%s", event_id)
|
||||
deleted = _get_db().delete_event(event_id)
|
||||
if deleted is None:
|
||||
raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
|
||||
files_removed = _cleanup_event_files(deleted)
|
||||
return {
|
||||
"status": "ok",
|
||||
"event_id": event_id,
|
||||
"files_removed": files_removed,
|
||||
}
|
||||
|
||||
|
||||
class BulkDeleteBody(BaseModel):
|
||||
"""Body for POST /db/events/delete_bulk."""
|
||||
serial: Optional[str] = None
|
||||
from_dt: Optional[str] = None # ISO-8601
|
||||
to_dt: Optional[str] = None # ISO-8601
|
||||
false_trigger: Optional[bool] = None
|
||||
ids: Optional[list[str]] = None
|
||||
confirm: bool = False
|
||||
# Safety: when no `ids` are supplied, require this many max rows to
|
||||
# actually be deleted; if the matched count exceeds it, the endpoint
|
||||
# returns a dry-run-style summary instead. Pass None to disable.
|
||||
max_rows: Optional[int] = 10000
|
||||
|
||||
|
||||
@app.post("/db/events/delete_bulk")
|
||||
def db_delete_events_bulk(body: BulkDeleteBody) -> dict:
|
||||
"""
|
||||
Hard-delete multiple events at once, by filter and/or by id list.
|
||||
|
||||
Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND,
|
||||
matching the same semantics as `GET /db/events`. `ids` is an additional
|
||||
inclusion list. At least one filter or non-empty `ids` MUST be supplied
|
||||
— refusing to wipe the whole table.
|
||||
|
||||
Safety knobs:
|
||||
- `confirm` MUST be `true` to actually delete. When false (default),
|
||||
returns the match count without deleting (dry-run).
|
||||
- `max_rows` (default 10,000) caps how many rows can be deleted in one
|
||||
call by-filter; if the match count exceeds it, the endpoint returns
|
||||
a count summary without deleting. Ignored when only `ids` is used.
|
||||
|
||||
Returns:
|
||||
{
|
||||
"status": "ok" | "dry_run" | "too_many",
|
||||
"matched": <int>,
|
||||
"deleted": <int>, # 0 unless status == "ok"
|
||||
"files_removed": <int>, # total file unlink successes
|
||||
"sample_serials": [...], # up to 5 distinct serials touched
|
||||
}
|
||||
"""
|
||||
log.info(
|
||||
"POST /db/events/delete_bulk serial=%s from=%s to=%s ft=%s ids=%d confirm=%s max=%s",
|
||||
body.serial, body.from_dt, body.to_dt, body.false_trigger,
|
||||
len(body.ids or []), body.confirm, body.max_rows,
|
||||
)
|
||||
|
||||
from_parsed = datetime.datetime.fromisoformat(body.from_dt) if body.from_dt else None
|
||||
to_parsed = datetime.datetime.fromisoformat(body.to_dt) if body.to_dt else None
|
||||
|
||||
db = _get_db()
|
||||
|
||||
# Dry-run path: count matches without deleting.
|
||||
rows = db.query_events(
|
||||
serial=body.serial,
|
||||
from_dt=from_parsed,
|
||||
to_dt=to_parsed,
|
||||
false_trigger=body.false_trigger,
|
||||
limit=1_000_000, # we want a true count, not a page
|
||||
offset=0,
|
||||
)
|
||||
if body.ids:
|
||||
id_set = set(body.ids)
|
||||
rows = [r for r in rows if r["id"] in id_set]
|
||||
matched = len(rows)
|
||||
sample_serials = sorted({r.get("serial") for r in rows[:50] if r.get("serial")})[:5]
|
||||
|
||||
if not body.confirm:
|
||||
return {
|
||||
"status": "dry_run",
|
||||
"matched": matched,
|
||||
"deleted": 0,
|
||||
"files_removed": 0,
|
||||
"sample_serials": sample_serials,
|
||||
"hint": "Set confirm=true in the request body to actually delete.",
|
||||
}
|
||||
|
||||
if body.max_rows is not None and not body.ids and matched > body.max_rows:
|
||||
return {
|
||||
"status": "too_many",
|
||||
"matched": matched,
|
||||
"deleted": 0,
|
||||
"files_removed": 0,
|
||||
"sample_serials": sample_serials,
|
||||
"hint": (
|
||||
f"Matched {matched} > max_rows={body.max_rows}. Either raise "
|
||||
f"max_rows in the body, narrow the filter, or supply an "
|
||||
f"explicit `ids` list."
|
||||
),
|
||||
}
|
||||
|
||||
try:
|
||||
deleted_rows = db.delete_events_bulk(
|
||||
serial=body.serial,
|
||||
from_dt=from_parsed,
|
||||
to_dt=to_parsed,
|
||||
false_trigger=body.false_trigger,
|
||||
ids=body.ids,
|
||||
)
|
||||
except ValueError as exc:
|
||||
raise HTTPException(status_code=422, detail=str(exc)) from exc
|
||||
|
||||
files_removed = 0
|
||||
for row in deleted_rows:
|
||||
result = _cleanup_event_files(row)
|
||||
files_removed += sum(1 for ok in result.values() if ok)
|
||||
|
||||
return {
|
||||
"status": "ok",
|
||||
"matched": matched,
|
||||
"deleted": len(deleted_rows),
|
||||
"files_removed": files_removed,
|
||||
"sample_serials": sample_serials,
|
||||
}
|
||||
|
||||
|
||||
# ── /db/events/{id} — waveform file accessors ─────────────────────────────────
|
||||
#
|
||||
# These endpoints serve files from the persistent WaveformStore, so a Blastware
|
||||
|
||||
Reference in New Issue
Block a user