From 57287a2ade186c12f17e82b528820fe4d90f797f Mon Sep 17 00:00:00 2001 From: serversdown Date: Sun, 17 May 2026 23:07:12 +0000 Subject: [PATCH] chore: update to 0.17.0 --- CHANGELOG.md | 56 ++++++++++++++++++++++++++++++++++++++++++++++++++ CLAUDE.md | 2 +- README.md | 2 +- pyproject.toml | 2 +- sfm/server.py | 2 +- 5 files changed, 60 insertions(+), 4 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0bdfdf4..a1c60e2 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,62 @@ All notable changes to seismo-relay are documented here. --- +## v0.17.0 — 2026-05-17 + +The "field rescue + DB management" release. Hardened against units that are stuck in a runaway call-home loop, and added an operator-facing path for purging bogus events that those same units dump into the DB before recovery. All work in this release was driven by the BE9558H incident (full incident log + recovery procedure at `docs/runbooks/wedged_unit_recovery.md`). + +### Added — wedged-unit recovery toolkit + +A toolkit for breaking the call-home loop on a misbehaving unit whose firmware is too busy to keep up with normal request/response handshakes. Tested in production against BE9558H (16 May 2026) — a unit with a stuck-triggered Long-axis geophone that had been call-homing the office BW ACH server every 30 seconds for hours. Endpoints layered from "single attempt" to "siege mode" to suit different contention levels: + +- **`GET /device/events/storage_range`** — SUB 0x06 probe. POLL + one read; ~2s. Returns first/last event keys and an `is_empty` flag. Use to triage whether a unit has stored events without invoking the slow `count_events()` 1E/1F chain (which choked on BE9558H's corrupted event chain). +- **`GET /device/events/index`** — SUB 0x08 probe. POLL + one read; ~2s. Returns the lifetime event counter (does NOT decrement on erase — use `storage_range` for "right now" state). +- **`POST /device/events/erase`** — full erase sequence `0xA3 → 0x1C → 0x06 → 0xA2` (confirmed 2026-04-11, see the protocol reference). Resets event keys to `0x01110000`. Caller's responsibility to disable ACH first if the underlying trigger condition will re-fill the buffer. +- **`POST /device/rescue`** — one TCP session, short connect+recv timeouts: POLL → disable ACH (compliance config write) → erase events → close. Designed for race-loop usage when the device is busy in another session. 503 on connect-refused, 502 on protocol failure, 200 on full sequence success. +- **`POST /device/stop_monitoring_blind`** — fire-and-forget Stop Monitoring (SUB 0x97), TCP-only. Dumps `SESSION_RESET + POLL_PROBE + SESSION_RESET + POLL_DATA + 0x97 × repeat` and closes without reading any S3 response. The full POLL preamble is required — write commands without it are silently ignored by the device's protocol parser (false-positive surface area that bit the first version of this endpoint). Use when the device's firmware can't keep up with full request/response but might process inbound bytes at its own pace. +- **`POST /device/stop_monitoring_spam`** — server-side hammer loop, duration-bounded. Open TCP → write the same blind payload → close → repeat as fast as possible until `duration_s` elapses. Configurable `connect_timeout` (default 500ms) and `repeat` (frames per session). Reports `sent_ok`, `connect_failed`, `write_failed`, `rate_attempts_per_s`. Clamped to 5min duration. +- **`POST /device/stop_monitoring_slow_drip`** — opposite of spam. Open ONE TCP session, drip the wake handshake + stop frames at `interval_s` (default 3s) for `duration_s` (default 120s, max 10min). Each drip is ~23 bytes — well under any UART FIFO size. Opportunistically drains any inbound bytes the device sends back; `bytes_received > 0` in the response strongly suggests the device has started talking and the session is healthy. **This is the endpoint that saved BE9558H.** Spam mode had been overrunning the device's UART FIFO; slow drip stayed under it. +- **Six rescue scripts** under `scripts/` — thin bash wrappers around the endpoints, default `SFM_BASE_URL=http://localhost:8200` (direct, not via Terra-View proxy whose 60s timeout would cut off the longer endpoints): + - `rescue_device.sh` — race-loop wrapper for `/device/rescue` + - `blind_stop.sh` — race-loop wrapper for `/device/stop_monitoring_blind` + - `spam_stop.sh` — single-call burst hammer + - `slow_drip.sh` — single-call held-session drip + - `watch_unit.sh` — passive periodic reachability check (every N min, logs to file), useful for unattended overnight monitoring of a wedged unit +- **`docs/runbooks/wedged_unit_recovery.md`** — symptoms, quick-reference recovery procedure, the modem-layer mechanism (Sierra Wireless serial-port mode-flipping is the real failure mode — not the device firmware), and a table of "why simpler approaches don't work" so the next incident skips the dead ends. + +### Added — operator event DB management + +Endpoints powering Terra-View's new `/admin/events` page (v0.12.0). Designed for purging bogus events from a unit that's been forwarding them in bulk (e.g. a stuck-triggered seismograph dumping hundreds of junk events before it's recovered). + +- **`DELETE /db/events/{event_id}`** — hard-delete one event row. Also unlinks the associated blastware binary (`.AB0*`), `.a5.pkl`, `.sfm.json` sidecar, and `.h5` clean-waveform files via the WaveformStore. Returns the per-file removal status. 404 if the event doesn't exist. +- **`POST /db/events/delete_bulk`** — filter-based or id-list-based bulk delete with safety rails: + - Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND; same semantics as `GET /db/events`. `ids` is an additional inclusion list. Refuses to run with no filters (would wipe the whole table — raises 422). + - `confirm` must be `true` to actually delete. Otherwise returns a dry-run summary (`status: "dry_run"`, `matched: N`, `sample_serials: [...]`). + - `max_rows` (default 10,000) caps how many rows can be deleted by-filter in one call. If exceeded, returns `status: "too_many"` with a hint to narrow or raise the cap. Bypassed when only `ids` is supplied. +- **`_cleanup_event_files(row)`** helper in `sfm/server.py` — best-effort `unlink()` of all four sidecar paths derived from the row's `blastware_filename`. Logged at WARN if a path exists but unlink fails; the DB row deletion still proceeds. +- **`SeismoDb.delete_event(id)` and `SeismoDb.delete_events_bulk(...)`** in `sfm/database.py` — both return the deleted row dict(s) so callers can do file cleanup. `delete_events_bulk` raises `ValueError` if no filters are supplied. + +### Changed + +- **Default protocol recv timeout dropped from 30s → 10s** in `_build_client()`. The unit usually responds in well under a second over cellular; 10s leaves comfortable headroom for retransmits while failing reasonably fast when a unit is wedged. The two endpoints that perform full 5A waveform downloads still pass `timeout=120.0` explicitly so multi-minute event transfers are unaffected. +- **`_build_client()` now accepts an optional `connect_timeout`** (TCP-only) so rescue / race-loop endpoints can fail fast on busy modems without affecting the protocol-level recv timeout. + +### Fixed + +- **`GET /device/monitor/status` returned HTTP 500 + uncaught traceback when the device was unresponsive**. The retry-on-`Exception` inner block let the second `client.poll()`'s `ProtocolError` propagate out of the handler. Now wrapped in proper try/except — returns 502 with `{"detail": "Protocol error: No S3 frame received within 10.0s ..."}` on timeout, 502 on connection errors, 500 only for genuinely unexpected exceptions. + +### Migration + +No schema changes. No data migration required. + +If you've been running a previous version against a wedged unit and accumulated bogus events, the new `/admin/events` page in Terra-View v0.12.0 (or direct `POST /db/events/delete_bulk` with `confirm: true`) is the cleanup tool. Watcher state on the upstream DL2 PC does NOT need separate cleaning — the watcher's `sfm_forwarded.json` keys on file sha256 and won't re-forward the same files. + +### Pairing + +This release pairs with **Terra-View v0.12.0**, which adds the `/admin/events` UI that consumes the new bulk-delete endpoints, the bulk false-trigger flagging on `/unit/{id}`, and the field-deployment workflow that uses the same `series3-watcher` → SFM ingest path as before. + +--- + ## v0.16.1 — 2026-05-14 ### Fixed diff --git a/CLAUDE.md b/CLAUDE.md index 56a9e69..1f8d71a 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -2,7 +2,7 @@ Ground-up Python replacement for **Blastware**, Instantel's Windows-only software for managing MiniMate Plus seismographs. Connects over direct RS-232 or cellular modem -(Sierra Wireless RV50 / RV55). Current version: **v0.14.3**. +(Sierra Wireless RV50 / RV55). Current version: **v0.17.0**. When new information about the protocol is discovered, please update the instantel_protocol_reference.md with the findings in addition to this document diff --git a/README.md b/README.md index e94c1cd..79534c8 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# seismo-relay `v0.16.1` +# seismo-relay `v0.17.0` A ground-up replacement for **Blastware** — Instantel's aging Windows-only software for managing MiniMate Plus seismographs. diff --git a/pyproject.toml b/pyproject.toml index c8abd3c..922e5f9 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta" [project] name = "seismo-relay" -version = "0.16.1" +version = "0.17.0" description = "Python client and REST server for MiniMate Plus seismographs" requires-python = ">=3.10" dependencies = [ diff --git a/sfm/server.py b/sfm/server.py index 95937c6..603e2d2 100644 --- a/sfm/server.py +++ b/sfm/server.py @@ -89,7 +89,7 @@ app = FastAPI( "Implements the minimateplus RS-232 protocol library.\n" "Proxied by terra-view at /api/sfm/*." ), - version="0.16.1", + version="0.17.0", ) # Allow requests from the waveform viewer opened as a local file (file://)