2026-05-17 19:13:57 -04:00
5 changed files with 60 additions and 4 deletions
@@ -4,6 +4,62 @@ All notable changes to seismo-relay are documented here.

 ---

+## v0.17.0 — 2026-05-17
+
+The "field rescue + DB management" release.  Hardened against units that are stuck in a runaway call-home loop, and added an operator-facing path for purging bogus events that those same units dump into the DB before recovery.  All work in this release was driven by the BE9558H incident (full incident log + recovery procedure at `docs/runbooks/wedged_unit_recovery.md`).
+
+### Added — wedged-unit recovery toolkit
+
+A toolkit for breaking the call-home loop on a misbehaving unit whose firmware is too busy to keep up with normal request/response handshakes.  Tested in production against BE9558H (16 May 2026) — a unit with a stuck-triggered Long-axis geophone that had been call-homing the office BW ACH server every 30 seconds for hours.  Endpoints layered from "single attempt" to "siege mode" to suit different contention levels:
+
+- **`GET /device/events/storage_range`** — SUB 0x06 probe.  POLL + one read; ~2s.  Returns first/last event keys and an `is_empty` flag.  Use to triage whether a unit has stored events without invoking the slow `count_events()` 1E/1F chain (which choked on BE9558H's corrupted event chain).
+- **`GET /device/events/index`** — SUB 0x08 probe.  POLL + one read; ~2s.  Returns the lifetime event counter (does NOT decrement on erase — use `storage_range` for "right now" state).
+- **`POST /device/events/erase`** — full erase sequence `0xA3 → 0x1C → 0x06 → 0xA2` (confirmed 2026-04-11, see the protocol reference).  Resets event keys to `0x01110000`.  Caller's responsibility to disable ACH first if the underlying trigger condition will re-fill the buffer.
+- **`POST /device/rescue`** — one TCP session, short connect+recv timeouts: POLL → disable ACH (compliance config write) → erase events → close.  Designed for race-loop usage when the device is busy in another session.  503 on connect-refused, 502 on protocol failure, 200 on full sequence success.
+- **`POST /device/stop_monitoring_blind`** — fire-and-forget Stop Monitoring (SUB 0x97), TCP-only.  Dumps `SESSION_RESET + POLL_PROBE + SESSION_RESET + POLL_DATA + 0x97 × repeat` and closes without reading any S3 response.  The full POLL preamble is required — write commands without it are silently ignored by the device's protocol parser (false-positive surface area that bit the first version of this endpoint).  Use when the device's firmware can't keep up with full request/response but might process inbound bytes at its own pace.
+- **`POST /device/stop_monitoring_spam`** — server-side hammer loop, duration-bounded.  Open TCP → write the same blind payload → close → repeat as fast as possible until `duration_s` elapses.  Configurable `connect_timeout` (default 500ms) and `repeat` (frames per session).  Reports `sent_ok`, `connect_failed`, `write_failed`, `rate_attempts_per_s`.  Clamped to 5min duration.
+- **`POST /device/stop_monitoring_slow_drip`** — opposite of spam.  Open ONE TCP session, drip the wake handshake + stop frames at `interval_s` (default 3s) for `duration_s` (default 120s, max 10min).  Each drip is ~23 bytes — well under any UART FIFO size.  Opportunistically drains any inbound bytes the device sends back; `bytes_received > 0` in the response strongly suggests the device has started talking and the session is healthy.  **This is the endpoint that saved BE9558H.** Spam mode had been overrunning the device's UART FIFO; slow drip stayed under it.
+- **Six rescue scripts** under `scripts/` — thin bash wrappers around the endpoints, default `SFM_BASE_URL=http://localhost:8200` (direct, not via Terra-View proxy whose 60s timeout would cut off the longer endpoints):
+    - `rescue_device.sh` — race-loop wrapper for `/device/rescue`
+    - `blind_stop.sh` — race-loop wrapper for `/device/stop_monitoring_blind`
+    - `spam_stop.sh` — single-call burst hammer
+    - `slow_drip.sh` — single-call held-session drip
+    - `watch_unit.sh` — passive periodic reachability check (every N min, logs to file), useful for unattended overnight monitoring of a wedged unit
+- **`docs/runbooks/wedged_unit_recovery.md`** — symptoms, quick-reference recovery procedure, the modem-layer mechanism (Sierra Wireless serial-port mode-flipping is the real failure mode — not the device firmware), and a table of "why simpler approaches don't work" so the next incident skips the dead ends.
+
+### Added — operator event DB management
+
+Endpoints powering Terra-View's new `/admin/events` page (v0.12.0).  Designed for purging bogus events from a unit that's been forwarding them in bulk (e.g. a stuck-triggered seismograph dumping hundreds of junk events before it's recovered).
+
+- **`DELETE /db/events/{event_id}`** — hard-delete one event row.  Also unlinks the associated blastware binary (`.AB0*`), `.a5.pkl`, `.sfm.json` sidecar, and `.h5` clean-waveform files via the WaveformStore.  Returns the per-file removal status.  404 if the event doesn't exist.
+- **`POST /db/events/delete_bulk`** — filter-based or id-list-based bulk delete with safety rails:
+    - Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND; same semantics as `GET /db/events`.  `ids` is an additional inclusion list.  Refuses to run with no filters (would wipe the whole table — raises 422).
+    - `confirm` must be `true` to actually delete.  Otherwise returns a dry-run summary (`status: "dry_run"`, `matched: N`, `sample_serials: [...]`).
+    - `max_rows` (default 10,000) caps how many rows can be deleted by-filter in one call.  If exceeded, returns `status: "too_many"` with a hint to narrow or raise the cap.  Bypassed when only `ids` is supplied.
+- **`_cleanup_event_files(row)`** helper in `sfm/server.py` — best-effort `unlink()` of all four sidecar paths derived from the row's `blastware_filename`.  Logged at WARN if a path exists but unlink fails; the DB row deletion still proceeds.
+- **`SeismoDb.delete_event(id)` and `SeismoDb.delete_events_bulk(...)`** in `sfm/database.py` — both return the deleted row dict(s) so callers can do file cleanup.  `delete_events_bulk` raises `ValueError` if no filters are supplied.
+
+### Changed
+
+- **Default protocol recv timeout dropped from 30s → 10s** in `_build_client()`.  The unit usually responds in well under a second over cellular; 10s leaves comfortable headroom for retransmits while failing reasonably fast when a unit is wedged.  The two endpoints that perform full 5A waveform downloads still pass `timeout=120.0` explicitly so multi-minute event transfers are unaffected.
+- **`_build_client()` now accepts an optional `connect_timeout`** (TCP-only) so rescue / race-loop endpoints can fail fast on busy modems without affecting the protocol-level recv timeout.
+
+### Fixed
+
+- **`GET /device/monitor/status` returned HTTP 500 + uncaught traceback when the device was unresponsive**.  The retry-on-`Exception` inner block let the second `client.poll()`'s `ProtocolError` propagate out of the handler.  Now wrapped in proper try/except — returns 502 with `{"detail": "Protocol error: No S3 frame received within 10.0s ..."}` on timeout, 502 on connection errors, 500 only for genuinely unexpected exceptions.
+
+### Migration
+
+No schema changes.  No data migration required.
+
+If you've been running a previous version against a wedged unit and accumulated bogus events, the new `/admin/events` page in Terra-View v0.12.0 (or direct `POST /db/events/delete_bulk` with `confirm: true`) is the cleanup tool.  Watcher state on the upstream DL2 PC does NOT need separate cleaning — the watcher's `sfm_forwarded.json` keys on file sha256 and won't re-forward the same files.
+
+### Pairing
+
+This release pairs with **Terra-View v0.12.0**, which adds the `/admin/events` UI that consumes the new bulk-delete endpoints, the bulk false-trigger flagging on `/unit/{id}`, and the field-deployment workflow that uses the same `series3-watcher` → SFM ingest path as before.
+
+---
+
 ## v0.16.1 — 2026-05-14

 ### Fixed
@@ -2,7 +2,7 @@

 Ground-up Python replacement for **Blastware**, Instantel's Windows-only software for
 managing MiniMate Plus seismographs. Connects over direct RS-232 or cellular modem
-(Sierra Wireless RV50 / RV55). Current version: **v0.14.3**.
+(Sierra Wireless RV50 / RV55). Current version: **v0.17.0**.

 When new information about the protocol is discovered, please update the instantel_protocol_reference.md with the findings in addition to this document

@@ -1,4 +1,4 @@
-# seismo-relay  `v0.16.1`
+# seismo-relay  `v0.17.0`

 A ground-up replacement for **Blastware** — Instantel's aging Windows-only
 software for managing MiniMate Plus seismographs.
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"

 [project]
 name = "seismo-relay"
-version = "0.16.1"
+version = "0.17.0"
 description = "Python client and REST server for MiniMate Plus seismographs"
 requires-python = ">=3.10"
 dependencies = [
@@ -89,7 +89,7 @@ app = FastAPI(
        "Implements the minimateplus RS-232 protocol library.\n"
        "Proxied by terra-view at /api/sfm/*."
    ),
-    version="0.16.1",
+    version="0.17.0",
 )

 # Allow requests from the waveform viewer opened as a local file (file://)