merge full s3 codec decoded #23
@@ -0,0 +1,28 @@
|
|||||||
|
.git
|
||||||
|
.gitignore
|
||||||
|
|
||||||
|
.venv
|
||||||
|
venv
|
||||||
|
env
|
||||||
|
__pycache__
|
||||||
|
*.pyc
|
||||||
|
*.pyo
|
||||||
|
*.pyd
|
||||||
|
.pytest_cache
|
||||||
|
.mypy_cache
|
||||||
|
.ruff_cache
|
||||||
|
|
||||||
|
*.db
|
||||||
|
*.db-wal
|
||||||
|
*.db-shm
|
||||||
|
*.sqlite
|
||||||
|
*.sqlite3
|
||||||
|
|
||||||
|
sfm/data
|
||||||
|
bridges/captures
|
||||||
|
example-events
|
||||||
|
captures
|
||||||
|
logs
|
||||||
|
|
||||||
|
.DS_Store
|
||||||
|
Thumbs.db
|
||||||
@@ -4,6 +4,76 @@ All notable changes to seismo-relay are documented here.
|
|||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
## v0.17.0 — 2026-05-17
|
||||||
|
|
||||||
|
The "field rescue + DB management" release. Hardened against units that are stuck in a runaway call-home loop, and added an operator-facing path for purging bogus events that those same units dump into the DB before recovery. All work in this release was driven by the BE9558H incident (full incident log + recovery procedure at `docs/runbooks/wedged_unit_recovery.md`).
|
||||||
|
|
||||||
|
### Added — wedged-unit recovery toolkit
|
||||||
|
|
||||||
|
A toolkit for breaking the call-home loop on a misbehaving unit whose firmware is too busy to keep up with normal request/response handshakes. Tested in production against BE9558H (16 May 2026) — a unit with a stuck-triggered Long-axis geophone that had been call-homing the office BW ACH server every 30 seconds for hours. Endpoints layered from "single attempt" to "siege mode" to suit different contention levels:
|
||||||
|
|
||||||
|
- **`GET /device/events/storage_range`** — SUB 0x06 probe. POLL + one read; ~2s. Returns first/last event keys and an `is_empty` flag. Use to triage whether a unit has stored events without invoking the slow `count_events()` 1E/1F chain (which choked on BE9558H's corrupted event chain).
|
||||||
|
- **`GET /device/events/index`** — SUB 0x08 probe. POLL + one read; ~2s. Returns the lifetime event counter (does NOT decrement on erase — use `storage_range` for "right now" state).
|
||||||
|
- **`POST /device/events/erase`** — full erase sequence `0xA3 → 0x1C → 0x06 → 0xA2` (confirmed 2026-04-11, see the protocol reference). Resets event keys to `0x01110000`. Caller's responsibility to disable ACH first if the underlying trigger condition will re-fill the buffer.
|
||||||
|
- **`POST /device/rescue`** — one TCP session, short connect+recv timeouts: POLL → disable ACH (compliance config write) → erase events → close. Designed for race-loop usage when the device is busy in another session. 503 on connect-refused, 502 on protocol failure, 200 on full sequence success.
|
||||||
|
- **`POST /device/stop_monitoring_blind`** — fire-and-forget Stop Monitoring (SUB 0x97), TCP-only. Dumps `SESSION_RESET + POLL_PROBE + SESSION_RESET + POLL_DATA + 0x97 × repeat` and closes without reading any S3 response. The full POLL preamble is required — write commands without it are silently ignored by the device's protocol parser (false-positive surface area that bit the first version of this endpoint). Use when the device's firmware can't keep up with full request/response but might process inbound bytes at its own pace.
|
||||||
|
- **`POST /device/stop_monitoring_spam`** — server-side hammer loop, duration-bounded. Open TCP → write the same blind payload → close → repeat as fast as possible until `duration_s` elapses. Configurable `connect_timeout` (default 500ms) and `repeat` (frames per session). Reports `sent_ok`, `connect_failed`, `write_failed`, `rate_attempts_per_s`. Clamped to 5min duration.
|
||||||
|
- **`POST /device/stop_monitoring_slow_drip`** — opposite of spam. Open ONE TCP session, drip the wake handshake + stop frames at `interval_s` (default 3s) for `duration_s` (default 120s, max 10min). Each drip is ~23 bytes — well under any UART FIFO size. Opportunistically drains any inbound bytes the device sends back; `bytes_received > 0` in the response strongly suggests the device has started talking and the session is healthy. **This is the endpoint that saved BE9558H.** Spam mode had been overrunning the device's UART FIFO; slow drip stayed under it.
|
||||||
|
- **Six rescue scripts** under `scripts/` — thin bash wrappers around the endpoints, default `SFM_BASE_URL=http://localhost:8200` (direct, not via Terra-View proxy whose 60s timeout would cut off the longer endpoints):
|
||||||
|
- `rescue_device.sh` — race-loop wrapper for `/device/rescue`
|
||||||
|
- `blind_stop.sh` — race-loop wrapper for `/device/stop_monitoring_blind`
|
||||||
|
- `spam_stop.sh` — single-call burst hammer
|
||||||
|
- `slow_drip.sh` — single-call held-session drip
|
||||||
|
- `watch_unit.sh` — passive periodic reachability check (every N min, logs to file), useful for unattended overnight monitoring of a wedged unit
|
||||||
|
- **`docs/runbooks/wedged_unit_recovery.md`** — symptoms, quick-reference recovery procedure, the modem-layer mechanism (Sierra Wireless serial-port mode-flipping is the real failure mode — not the device firmware), and a table of "why simpler approaches don't work" so the next incident skips the dead ends.
|
||||||
|
|
||||||
|
### Added — operator event DB management
|
||||||
|
|
||||||
|
Endpoints powering Terra-View's new `/admin/events` page (v0.12.0). Designed for purging bogus events from a unit that's been forwarding them in bulk (e.g. a stuck-triggered seismograph dumping hundreds of junk events before it's recovered).
|
||||||
|
|
||||||
|
- **`DELETE /db/events/{event_id}`** — hard-delete one event row. Also unlinks the associated blastware binary (`.AB0*`), `.a5.pkl`, `.sfm.json` sidecar, and `.h5` clean-waveform files via the WaveformStore. Returns the per-file removal status. 404 if the event doesn't exist.
|
||||||
|
- **`POST /db/events/delete_bulk`** — filter-based or id-list-based bulk delete with safety rails:
|
||||||
|
- Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND; same semantics as `GET /db/events`. `ids` is an additional inclusion list. Refuses to run with no filters (would wipe the whole table — raises 422).
|
||||||
|
- `confirm` must be `true` to actually delete. Otherwise returns a dry-run summary (`status: "dry_run"`, `matched: N`, `sample_serials: [...]`).
|
||||||
|
- `max_rows` (default 10,000) caps how many rows can be deleted by-filter in one call. If exceeded, returns `status: "too_many"` with a hint to narrow or raise the cap. Bypassed when only `ids` is supplied.
|
||||||
|
- **`_cleanup_event_files(row)`** helper in `sfm/server.py` — best-effort `unlink()` of all four sidecar paths derived from the row's `blastware_filename`. Logged at WARN if a path exists but unlink fails; the DB row deletion still proceeds.
|
||||||
|
- **`SeismoDb.delete_event(id)` and `SeismoDb.delete_events_bulk(...)`** in `sfm/database.py` — both return the deleted row dict(s) so callers can do file cleanup. `delete_events_bulk` raises `ValueError` if no filters are supplied.
|
||||||
|
|
||||||
|
### Changed
|
||||||
|
|
||||||
|
- **Default protocol recv timeout dropped from 30s → 10s** in `_build_client()`. The unit usually responds in well under a second over cellular; 10s leaves comfortable headroom for retransmits while failing reasonably fast when a unit is wedged. The two endpoints that perform full 5A waveform downloads still pass `timeout=120.0` explicitly so multi-minute event transfers are unaffected.
|
||||||
|
- **`_build_client()` now accepts an optional `connect_timeout`** (TCP-only) so rescue / race-loop endpoints can fail fast on busy modems without affecting the protocol-level recv timeout.
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- **`GET /device/monitor/status` returned HTTP 500 + uncaught traceback when the device was unresponsive**. The retry-on-`Exception` inner block let the second `client.poll()`'s `ProtocolError` propagate out of the handler. Now wrapped in proper try/except — returns 502 with `{"detail": "Protocol error: No S3 frame received within 10.0s ..."}` on timeout, 502 on connection errors, 500 only for genuinely unexpected exceptions.
|
||||||
|
|
||||||
|
### Migration
|
||||||
|
|
||||||
|
No schema changes. No data migration required.
|
||||||
|
|
||||||
|
If you've been running a previous version against a wedged unit and accumulated bogus events, the new `/admin/events` page in Terra-View v0.12.0 (or direct `POST /db/events/delete_bulk` with `confirm: true`) is the cleanup tool. Watcher state on the upstream DL2 PC does NOT need separate cleaning — the watcher's `sfm_forwarded.json` keys on file sha256 and won't re-forward the same files.
|
||||||
|
|
||||||
|
### Pairing
|
||||||
|
|
||||||
|
This release pairs with **Terra-View v0.12.0**, which adds the `/admin/events` UI that consumes the new bulk-delete endpoints, the bulk false-trigger flagging on `/unit/{id}`, and the field-deployment workflow that uses the same `series3-watcher` → SFM ingest path as before.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## v0.16.1 — 2026-05-14
|
||||||
|
|
||||||
|
### Fixed
|
||||||
|
|
||||||
|
- **`record_type` always "Waveform" for forwarded events.** `read_blastware_file()` hardcoded `ev.record_type = "Waveform"` regardless of the file's actual type. The watcher-forward pipeline (the main BW ACH ingest path) compounds this by parsing files from a tmp path with a `.bw` suffix, so even a filename-based fallback inside the parser still wouldn't see the original extension. Now:
|
||||||
|
|
||||||
|
1. New `derive_record_type_from_filename(filename)` helper in `minimateplus/event_file_io.py` derives the type from the LAST character of the filename's extension (V10.72+ AB0T scheme: `H`=Histogram, `W`=Waveform, `M`=Manual, `E`=Event, `C`=Combo). Falls back to `"Waveform"` for old S338 firmware (3-char extensions ending in `0`) and any unrecognized suffix.
|
||||||
|
2. `read_blastware_file()` now calls the helper with its `path.name` so direct callers (the `--dry-run` path in `scripts/import_bw.py`, tests, ad-hoc scripts) get the right value automatically.
|
||||||
|
3. `WaveformStore.save_imported_bw()` overrides `ev.record_type` with the **original** filename's derived type after parsing (the tmp file inside the parser doesn't carry the original extension). This is the path the live watcher-forwarder hits, so the DB column now reflects the actual event type going forward.
|
||||||
|
|
||||||
|
Events ingested before this fix are stuck with `record_type="Waveform"` in the DB; a one-off backfill (`UPDATE events SET record_type = ... WHERE blastware_filename LIKE '%H'`) would fix them retroactively if desired. Terra-view's event modal also derives client-side from the filename, so the UI already shows the correct type for old events even without the backfill.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
## v0.16.0 — 2026-05-11
|
## v0.16.0 — 2026-05-11
|
||||||
|
|
||||||
The "BW ACH ingestion" release. When paired with **series3-watcher v1.5.0**, every Blastware ACH event (binary + `_ASCII.TXT` report) lands in SeismoDb with device-authoritative peaks, project metadata, sensor self-check, and ZC/Time-of-Peak data — without depending on the still-undecoded waveform body codec. This is the end-to-end product win discussed in v0.15.0's "out of scope" notes: sortable / filterable monthly-summary review of historical events, populated from the BW ASCII export rather than re-decoded samples.
|
The "BW ACH ingestion" release. When paired with **series3-watcher v1.5.0**, every Blastware ACH event (binary + `_ASCII.TXT` report) lands in SeismoDb with device-authoritative peaks, project metadata, sensor self-check, and ZC/Time-of-Peak data — without depending on the still-undecoded waveform body codec. This is the end-to-end product win discussed in v0.15.0's "out of scope" notes: sortable / filterable monthly-summary review of historical events, populated from the BW ASCII export rather than re-decoded samples.
|
||||||
|
|||||||
@@ -2,7 +2,7 @@
|
|||||||
|
|
||||||
Ground-up Python replacement for **Blastware**, Instantel's Windows-only software for
|
Ground-up Python replacement for **Blastware**, Instantel's Windows-only software for
|
||||||
managing MiniMate Plus seismographs. Connects over direct RS-232 or cellular modem
|
managing MiniMate Plus seismographs. Connects over direct RS-232 or cellular modem
|
||||||
(Sierra Wireless RV50 / RV55). Current version: **v0.14.3**.
|
(Sierra Wireless RV50 / RV55). Current version: **v0.17.0**.
|
||||||
|
|
||||||
When new information about the protocol is discovered, please update the instantel_protocol_reference.md with the findings in addition to this document
|
When new information about the protocol is discovered, please update the instantel_protocol_reference.md with the findings in addition to this document
|
||||||
|
|
||||||
|
|||||||
+18
@@ -0,0 +1,18 @@
|
|||||||
|
FROM python:3.11-slim
|
||||||
|
|
||||||
|
WORKDIR /app
|
||||||
|
|
||||||
|
RUN apt-get update && \
|
||||||
|
apt-get install -y --no-install-recommends curl && \
|
||||||
|
rm -rf /var/lib/apt/lists/*
|
||||||
|
|
||||||
|
COPY pyproject.toml requirements.txt ./
|
||||||
|
COPY minimateplus ./minimateplus
|
||||||
|
COPY sfm ./sfm
|
||||||
|
COPY bridges ./bridges
|
||||||
|
|
||||||
|
RUN pip install --no-cache-dir -e .
|
||||||
|
|
||||||
|
EXPOSE 8200
|
||||||
|
|
||||||
|
CMD ["python", "-m", "uvicorn", "sfm.server:app", "--host", "0.0.0.0", "--port", "8200"]
|
||||||
@@ -1,4 +1,4 @@
|
|||||||
# seismo-relay `v0.16.0`
|
# seismo-relay `v0.17.0`
|
||||||
|
|
||||||
A ground-up replacement for **Blastware** — Instantel's aging Windows-only
|
A ground-up replacement for **Blastware** — Instantel's aging Windows-only
|
||||||
software for managing MiniMate Plus seismographs.
|
software for managing MiniMate Plus seismographs.
|
||||||
|
|||||||
@@ -0,0 +1,255 @@
|
|||||||
|
# Runbook — Recovering a wedged unit stuck in a call-home loop
|
||||||
|
|
||||||
|
**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
|
||||||
|
|
||||||
|
A field unit with a stuck-triggered geophone (or any hardware fault causing
|
||||||
|
constant event triggering) will record events back-to-back, and if Auto Call
|
||||||
|
Home is set to "After Event Recorded" the device will dial the office BW
|
||||||
|
ACH server in a tight loop. Combined with a Sierra Wireless modem in
|
||||||
|
bidirectional serial-TCP mode, this makes the unit effectively unreachable
|
||||||
|
from SFM — every TCP connection we open gets killed when the modem flips
|
||||||
|
from server-mode to client-mode to honor the device's next AT dial command.
|
||||||
|
|
||||||
|
This runbook describes how to break the loop and recover control.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
|
||||||
|
- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
|
||||||
|
waiting for POLL response) or 503 (TCP connect refused).
|
||||||
|
- ACEmanager serial log shows repeating
|
||||||
|
`Connect to IP: <BW_IP> Port: <BW_PORT>` → `Shutdown TCP socket` cycles
|
||||||
|
every 30-60 seconds.
|
||||||
|
- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
|
||||||
|
`sent_ok` but the device's monitoring state never changes.
|
||||||
|
- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
|
||||||
|
but before completing the drip loop.
|
||||||
|
|
||||||
|
If you see *all* of these, the unit is in this exact failure mode.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick reference — how to recover
|
||||||
|
|
||||||
|
You need **ACEmanager access** to the unit's modem.
|
||||||
|
|
||||||
|
### Step 1: stop the modem's mode-flipping
|
||||||
|
|
||||||
|
In ACEmanager → **Serial → Port Configuration**:
|
||||||
|
|
||||||
|
| Field | Set to |
|
||||||
|
|---|---|
|
||||||
|
| **Destination Address** | clear (blank) |
|
||||||
|
| **Destination Port** | `0` |
|
||||||
|
|
||||||
|
Click **Apply**. This removes the modem's auto-dial-out target. The device's
|
||||||
|
AT dial commands now error back at the modem instead of triggering a
|
||||||
|
mode-flip, so the modem stays in TCP-server mode permanently and our inbound
|
||||||
|
TCP sessions stay alive.
|
||||||
|
|
||||||
|
*(Optional belt-and-suspenders: also add the BW server's port to
|
||||||
|
**Security → Port Filtering - Outbound** as a blocked port, with
|
||||||
|
Outbound Port Filtering Mode = Blocked Ports.)*
|
||||||
|
|
||||||
|
### Step 2: stop monitoring on the device (slow drip)
|
||||||
|
|
||||||
|
From the SFM host:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
|
||||||
|
```
|
||||||
|
|
||||||
|
Defaults are 120s duration with a drip every 3s. Watch the response:
|
||||||
|
|
||||||
|
- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
|
||||||
|
- `bytes_received > 0` → device is responding ✓ (this is the success signal)
|
||||||
|
|
||||||
|
If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
|
||||||
|
hold — re-check ACEmanager, may need to reboot the modem after Apply.
|
||||||
|
|
||||||
|
### Step 3: confirm monitoring stopped
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
|
||||||
|
# expect: {"is_monitoring": false, ...}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Step 4: disable ACH at the device level + erase corrupted events
|
||||||
|
|
||||||
|
Either fire the rescue endpoint:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
|
||||||
|
```
|
||||||
|
|
||||||
|
Or do the two steps manually:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Disable ACH in the device's compliance config
|
||||||
|
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
|
||||||
|
-H 'Content-Type: application/json' \
|
||||||
|
-d '{"auto_call_home_enabled": false}'
|
||||||
|
|
||||||
|
# Erase corrupted event chain
|
||||||
|
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
|
||||||
|
```
|
||||||
|
|
||||||
|
You can also do this via the SFM standalone UI → **Call Home** tab → set
|
||||||
|
`Enable Auto Call Home` to `Disabled` → **Write to Device**.
|
||||||
|
|
||||||
|
### Step 5: restore modem config (housekeeping)
|
||||||
|
|
||||||
|
Once the device-side ACH is disabled, restore the modem's Destination
|
||||||
|
Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
|
||||||
|
ACEmanager. The modem will resume normal bidirectional behavior, but the
|
||||||
|
unit won't issue any dial commands until ACH is explicitly re-enabled on
|
||||||
|
the device.
|
||||||
|
|
||||||
|
### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
|
||||||
|
fault is repaired. If you do, the call-home loop starts again immediately
|
||||||
|
and you'll be running this runbook a second time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why this works — the failure mode explained
|
||||||
|
|
||||||
|
The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
|
||||||
|
at any moment:
|
||||||
|
|
||||||
|
- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
|
||||||
|
TCP to the device's serial port. This is what we need to interact with
|
||||||
|
the device.
|
||||||
|
- **Client mode** — when the device sends an AT dial command on its serial
|
||||||
|
TX line, the modem opens an outbound TCP to `Destination Address:Port`
|
||||||
|
and bridges that to serial.
|
||||||
|
|
||||||
|
A serial port in this configuration is **bidirectional**: the modem flips
|
||||||
|
between server and client modes on demand. When the device's firmware is
|
||||||
|
healthy and only dials occasionally, this works fine.
|
||||||
|
|
||||||
|
When the unit is constantly triggering events and ACH is set to "After
|
||||||
|
Event Recorded", the device sends an AT dial command every few seconds.
|
||||||
|
Each one causes the modem to:
|
||||||
|
|
||||||
|
1. Drop any active inbound TCP session
|
||||||
|
2. Flip to client mode
|
||||||
|
3. Attempt outbound TCP to `Destination Address:Port`
|
||||||
|
4. Hang for up to a minute waiting for it to succeed/fail
|
||||||
|
5. Drop back to server mode
|
||||||
|
|
||||||
|
**During the entire hang, no inbound TCP can establish.** Even between
|
||||||
|
hangs, the modem closes any existing inbound session before flipping. So
|
||||||
|
any tool that needs more than a few seconds of held TCP (e.g. POLL +
|
||||||
|
config read + write) gets repeatedly kicked off.
|
||||||
|
|
||||||
|
Clearing `Destination Address` removes step 3-4 from the cycle: the modem
|
||||||
|
has nowhere to dial, so it doesn't flip modes when it receives an AT dial
|
||||||
|
command. The serial port effectively becomes server-only, and inbound TCP
|
||||||
|
sessions can stay open as long as needed.
|
||||||
|
|
||||||
|
**This is a modem-layer issue, not a device firmware issue.** The device
|
||||||
|
is alive and responsive the whole time — confirmed in the BE9558H
|
||||||
|
recovery by 990 bytes of S3 responses received over a 120s slow-drip
|
||||||
|
session once the modem was no longer mode-flipping.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Why simpler approaches don't work
|
||||||
|
|
||||||
|
| Approach | Why it fails |
|
||||||
|
|---|---|
|
||||||
|
| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
|
||||||
|
| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
|
||||||
|
| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
|
||||||
|
| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
|
||||||
|
| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
|
||||||
|
| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
|
||||||
|
|
||||||
|
The combination of `slow_drip` + cleared `Destination Address` works because:
|
||||||
|
|
||||||
|
1. The modem stops mode-flipping → TCP session stays open for the full
|
||||||
|
drip duration
|
||||||
|
2. Slow drip rate → device's UART RX FIFO never overflows even if
|
||||||
|
firmware is busy with event recording
|
||||||
|
3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
|
||||||
|
independent chances for the parser to land one valid frame
|
||||||
|
4. Once one Stop Monitoring is parsed, event recording halts → firmware
|
||||||
|
has CPU to spare → subsequent operations are trivially easy
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tooling reference
|
||||||
|
|
||||||
|
All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
|
||||||
|
`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
|
||||||
|
overridable via `SFM_BASE_URL`.
|
||||||
|
|
||||||
|
### Endpoints added during BE9558H recovery
|
||||||
|
|
||||||
|
| Endpoint | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
|
||||||
|
| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
|
||||||
|
| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
|
||||||
|
| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
|
||||||
|
| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
|
||||||
|
| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
|
||||||
|
| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
|
||||||
|
|
||||||
|
Also changed: default protocol recv timeout dropped from 30s → 10s in
|
||||||
|
`_build_client`. Added `connect_timeout` knob to same. Cleaned up
|
||||||
|
unhandled-exception path in `/device/monitor/status` so it returns 502
|
||||||
|
instead of 500 on protocol timeouts.
|
||||||
|
|
||||||
|
### Scripts
|
||||||
|
|
||||||
|
| Script | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
|
||||||
|
| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
|
||||||
|
| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
|
||||||
|
| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
|
||||||
|
| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Incident log — BE9558H, 2026-05-16/17
|
||||||
|
|
||||||
|
What was wrong: Long-axis geophone developed an offset, constantly above
|
||||||
|
trigger threshold → constant event recording → after-event ACH set →
|
||||||
|
modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
|
||||||
|
Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
|
||||||
|
|
||||||
|
Diagnostic path:
|
||||||
|
|
||||||
|
1. `/device/info` slow, choked on event walk
|
||||||
|
2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
|
||||||
|
but didn't reach the wedged unit
|
||||||
|
3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
|
||||||
|
4. Built `/device/stop_monitoring_blind` — first version was a false
|
||||||
|
positive (no POLL preamble); fixed by including
|
||||||
|
`SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
|
||||||
|
5. Verified blind stop works on bench unit
|
||||||
|
6. Built `/device/stop_monitoring_spam` — 420 successful sends over
|
||||||
|
5 min, zero behavior change on field unit
|
||||||
|
7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
|
||||||
|
confirmed device was not fully locked up
|
||||||
|
8. Added outbound port-12345 firewall block → outbound attempts now fail
|
||||||
|
instantly but contention persisted
|
||||||
|
9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
|
||||||
|
broken pipe (modem closing on us)
|
||||||
|
10. Looked at full ACEmanager Port Configuration → **found
|
||||||
|
`Destination Address: 50.197.32.92` configured**, realized every AT
|
||||||
|
dial command was triggering a modem mode-flip that killed our inbound
|
||||||
|
11. Cleared Destination Address + Port → slow_drip held 120s, device
|
||||||
|
responded with 990 bytes, 39 stop commands acked
|
||||||
|
12. Disabled ACH at device level via `/device/call_home`, erased events
|
||||||
|
|
||||||
|
Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
|
||||||
|
device level, modem destination cleared (to be restored after physical
|
||||||
|
service).
|
||||||
|
|
||||||
|
Total time from "i was wondering if its possible to" first attempt to
|
||||||
|
recovery: ~7 hours of intermittent debugging across one evening.
|
||||||
@@ -47,7 +47,7 @@ SIDECAR_KIND = "sfm.event"
|
|||||||
# bumped without a `pip install` re-run — leading to confusing stale
|
# bumped without a `pip install` re-run — leading to confusing stale
|
||||||
# version stamps in sidecars. Bump this constant and CHANGELOG.md
|
# version stamps in sidecars. Bump this constant and CHANGELOG.md
|
||||||
# together at release time.
|
# together at release time.
|
||||||
TOOL_VERSION = "0.16.0"
|
TOOL_VERSION = "0.16.1"
|
||||||
|
|
||||||
try:
|
try:
|
||||||
# Best-effort: prefer the installed metadata when it's NEWER than the
|
# Best-effort: prefer the installed metadata when it's NEWER than the
|
||||||
@@ -646,6 +646,50 @@ def _peaks_from_samples(samples: dict[str, list[int]]) -> PeakValues:
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
_RECORD_TYPE_BY_EXT_SUFFIX = {
|
||||||
|
'H': 'Histogram',
|
||||||
|
'W': 'Waveform',
|
||||||
|
'M': 'Manual',
|
||||||
|
'E': 'Event',
|
||||||
|
'C': 'Combo',
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def derive_record_type_from_filename(filename, default: str = "Waveform") -> str:
|
||||||
|
"""Derive a BW Event's record_type from its filename's extension suffix.
|
||||||
|
|
||||||
|
V10.72+ MiniMate Plus firmware encodes the event type as the LAST
|
||||||
|
character of the extension (the `T` in BW's `AB0T` scheme):
|
||||||
|
|
||||||
|
``M529LKIQ.G10H`` → H → ``"Histogram"``
|
||||||
|
``T350L385.VY0W`` → W → ``"Waveform"``
|
||||||
|
``...M`` → M → ``"Manual"``
|
||||||
|
``...E`` → E → ``"Event"``
|
||||||
|
``...C`` → C → ``"Combo"``
|
||||||
|
|
||||||
|
Old S338 firmware uses 3-char extensions ending in ``0`` whose
|
||||||
|
encoding is not yet known — those fall through to ``default``.
|
||||||
|
Micromate Series 4 uses a different scheme entirely (observed:
|
||||||
|
``IDFH``, ``IDFW``) but the LAST-char convention (H / W) still holds
|
||||||
|
for the type code, so it works for both families.
|
||||||
|
|
||||||
|
Returns ``default`` if filename is empty, has no extension, or the
|
||||||
|
suffix char isn't a recognized type code.
|
||||||
|
"""
|
||||||
|
if not filename:
|
||||||
|
return default
|
||||||
|
try:
|
||||||
|
name = Path(filename).name
|
||||||
|
except (TypeError, ValueError):
|
||||||
|
return default
|
||||||
|
if '.' not in name:
|
||||||
|
return default
|
||||||
|
ext = name.rsplit('.', 1)[1]
|
||||||
|
if not ext:
|
||||||
|
return default
|
||||||
|
return _RECORD_TYPE_BY_EXT_SUFFIX.get(ext[-1].upper(), default)
|
||||||
|
|
||||||
|
|
||||||
def read_blastware_file(path: Union[str, Path]) -> Event:
|
def read_blastware_file(path: Union[str, Path]) -> Event:
|
||||||
"""
|
"""
|
||||||
Parse a Blastware waveform file into an Event.
|
Parse a Blastware waveform file into an Event.
|
||||||
@@ -727,7 +771,12 @@ def read_blastware_file(path: Union[str, Path]) -> Event:
|
|||||||
ev = Event(index=-1)
|
ev = Event(index=-1)
|
||||||
if strt_fields.get("waveform_key"):
|
if strt_fields.get("waveform_key"):
|
||||||
ev._waveform_key = bytes.fromhex(strt_fields["waveform_key"])
|
ev._waveform_key = bytes.fromhex(strt_fields["waveform_key"])
|
||||||
ev.record_type = "Waveform"
|
# Derive record_type from the filename's extension suffix (H/W/M/E/C).
|
||||||
|
# When called from save_imported_bw the path here is a tmp file with a
|
||||||
|
# ".bw" suffix, so the derivation falls back to "Waveform" and the
|
||||||
|
# caller overrides ev.record_type using the original filename — see
|
||||||
|
# waveform_store.save_imported_bw.
|
||||||
|
ev.record_type = derive_record_type_from_filename(path.name)
|
||||||
ev.rectime_seconds = strt_fields.get("rectime_seconds")
|
ev.rectime_seconds = strt_fields.get("rectime_seconds")
|
||||||
ev.total_samples = strt_fields.get("total_samples")
|
ev.total_samples = strt_fields.get("total_samples")
|
||||||
ev.pretrig_samples = strt_fields.get("pretrig_samples")
|
ev.pretrig_samples = strt_fields.get("pretrig_samples")
|
||||||
|
|||||||
+1
-1
@@ -4,7 +4,7 @@ build-backend = "setuptools.build_meta"
|
|||||||
|
|
||||||
[project]
|
[project]
|
||||||
name = "seismo-relay"
|
name = "seismo-relay"
|
||||||
version = "0.16.0"
|
version = "0.17.0"
|
||||||
description = "Python client and REST server for MiniMate Plus seismographs"
|
description = "Python client and REST server for MiniMate Plus seismographs"
|
||||||
requires-python = ">=3.10"
|
requires-python = ">=3.10"
|
||||||
dependencies = [
|
dependencies = [
|
||||||
|
|||||||
@@ -0,0 +1,150 @@
|
|||||||
|
"""
|
||||||
|
scripts/backfill_record_type.py — fix `record_type` on legacy event
|
||||||
|
rows whose value was hardcoded to "Waveform" regardless of actual type.
|
||||||
|
|
||||||
|
Why this is needed
|
||||||
|
──────────────────
|
||||||
|
Pre-v0.16.1 the BW file importer (`event_file_io.read_blastware_file`)
|
||||||
|
hardcoded `ev.record_type = "Waveform"` for every imported event. Fixed
|
||||||
|
in commit aac1c8e — new ingests now derive the type from the Blastware
|
||||||
|
filename's extension last character (H=Histogram, W=Waveform, M=Manual,
|
||||||
|
E=Event, C=Combo) per the V10.72+ MiniMate Plus AB0T filename scheme.
|
||||||
|
|
||||||
|
Effect on a server that imported events under the old code: every
|
||||||
|
events row has `record_type = "Waveform"`, even for histograms,
|
||||||
|
manuals, etc. Visible in terra-view's event-detail modal under the
|
||||||
|
"Record Type" field. Terra-view also has a client-side workaround
|
||||||
|
that derives the type from the filename for display purposes, so
|
||||||
|
operators see the correct type in the UI even before this backfill.
|
||||||
|
This script makes the DB column match what the UI is already showing,
|
||||||
|
which matters for reporting and any downstream consumer that reads
|
||||||
|
events.record_type directly.
|
||||||
|
|
||||||
|
This script
|
||||||
|
───────────
|
||||||
|
Walks the `events` table and updates each row's `record_type` to the
|
||||||
|
derived value from its `blastware_filename`. Old S338 firmware files
|
||||||
|
(3-char extensions ending in `0`) and any unrecognized suffix get
|
||||||
|
left at the existing value (defaults to "Waveform").
|
||||||
|
|
||||||
|
Idempotent: re-running after a successful backfill finds zero rows
|
||||||
|
needing updates and exits cleanly (it always re-derives but only
|
||||||
|
writes when the value would change).
|
||||||
|
|
||||||
|
Usage
|
||||||
|
─────
|
||||||
|
# Dry-run (default): print what would change, don't touch the DB
|
||||||
|
python -m scripts.backfill_record_type --db bridges/captures/seismo_relay.db
|
||||||
|
|
||||||
|
# Apply the backfill
|
||||||
|
python -m scripts.backfill_record_type --db bridges/captures/seismo_relay.db --apply
|
||||||
|
"""
|
||||||
|
|
||||||
|
from __future__ import annotations
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import sqlite3
|
||||||
|
import sys
|
||||||
|
from collections import Counter
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
|
||||||
|
# Must stay in sync with minimateplus.event_file_io._RECORD_TYPE_BY_EXT_SUFFIX.
|
||||||
|
_TYPE_FROM_SUFFIX = {
|
||||||
|
"H": "Histogram",
|
||||||
|
"W": "Waveform",
|
||||||
|
"M": "Manual",
|
||||||
|
"E": "Event",
|
||||||
|
"C": "Combo",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def derive_record_type(filename: str | None, default: str = "Waveform") -> str:
|
||||||
|
"""Mirror of minimateplus.event_file_io.derive_record_type_from_filename.
|
||||||
|
|
||||||
|
Vendored here so this script runs without needing the seismo-relay
|
||||||
|
package on the Python path (useful on prod where you might be
|
||||||
|
running it via `docker exec` against a container's DB volume).
|
||||||
|
"""
|
||||||
|
if not filename:
|
||||||
|
return default
|
||||||
|
name = Path(filename).name
|
||||||
|
if "." not in name:
|
||||||
|
return default
|
||||||
|
ext = name.rsplit(".", 1)[1]
|
||||||
|
if not ext:
|
||||||
|
return default
|
||||||
|
return _TYPE_FROM_SUFFIX.get(ext[-1].upper(), default)
|
||||||
|
|
||||||
|
|
||||||
|
def main() -> int:
|
||||||
|
ap = argparse.ArgumentParser(description=__doc__)
|
||||||
|
ap.add_argument("--db", required=True, help="Path to seismo_relay.db")
|
||||||
|
ap.add_argument("--apply", action="store_true",
|
||||||
|
help="Actually write changes (default is dry-run).")
|
||||||
|
ap.add_argument("--default", default="Waveform",
|
||||||
|
help="Fallback record_type when filename doesn't encode one. "
|
||||||
|
"Default: Waveform (matches the pre-fix bug's behavior).")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
db_path = Path(args.db)
|
||||||
|
if not db_path.exists():
|
||||||
|
print(f"ERROR: database not found at {db_path}", file=sys.stderr)
|
||||||
|
return 1
|
||||||
|
|
||||||
|
conn = sqlite3.connect(str(db_path))
|
||||||
|
conn.row_factory = sqlite3.Row
|
||||||
|
cur = conn.cursor()
|
||||||
|
|
||||||
|
cur.execute("""
|
||||||
|
SELECT id, blastware_filename, record_type
|
||||||
|
FROM events
|
||||||
|
WHERE blastware_filename IS NOT NULL
|
||||||
|
AND blastware_filename != ''
|
||||||
|
""")
|
||||||
|
rows = cur.fetchall()
|
||||||
|
total = len(rows)
|
||||||
|
print(f"Scanning {total:,} event rows…")
|
||||||
|
print()
|
||||||
|
|
||||||
|
# Tally proposed changes.
|
||||||
|
transitions: Counter[tuple[str, str]] = Counter()
|
||||||
|
update_ids: list[tuple[str, str]] = []
|
||||||
|
unrecognized = 0
|
||||||
|
|
||||||
|
for row in rows:
|
||||||
|
derived = derive_record_type(row["blastware_filename"], default=args.default)
|
||||||
|
current = row["record_type"] or ""
|
||||||
|
if derived == current:
|
||||||
|
continue
|
||||||
|
transitions[(current, derived)] += 1
|
||||||
|
update_ids.append((row["id"], derived))
|
||||||
|
|
||||||
|
if not update_ids:
|
||||||
|
print("Nothing to update — all rows already match.")
|
||||||
|
conn.close()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print(f"{len(update_ids):,} row(s) need updating:")
|
||||||
|
for (old, new), count in sorted(transitions.items(), key=lambda x: -x[1]):
|
||||||
|
print(f" {count:>6,} {old!r:14s} → {new!r}")
|
||||||
|
print()
|
||||||
|
|
||||||
|
if not args.apply:
|
||||||
|
print("(dry-run — re-run with --apply to write changes)")
|
||||||
|
conn.close()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
print("Applying changes…")
|
||||||
|
cur.executemany(
|
||||||
|
"UPDATE events SET record_type = ? WHERE id = ?",
|
||||||
|
[(new, eid) for eid, new in update_ids],
|
||||||
|
)
|
||||||
|
conn.commit()
|
||||||
|
print(f"Done. Updated {cur.rowcount:,} row(s).")
|
||||||
|
conn.close()
|
||||||
|
return 0
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
sys.exit(main())
|
||||||
Executable
+100
@@ -0,0 +1,100 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Fire-and-forget Stop Monitoring loop — for wedged or constantly-triggering units.
|
||||||
|
#
|
||||||
|
# Hammers POST /device/stop_monitoring_blind in a tight loop. The endpoint
|
||||||
|
# opens TCP, dumps SESSION_RESET + a few copies of the SUB 0x97 frame, and
|
||||||
|
# closes — without ever reading an S3 response. Each TCP-won attempt is
|
||||||
|
# ~50ms of wire activity instead of the multi-frame handshake the regular
|
||||||
|
# rescue endpoint does, so windows that are too small for the full rescue
|
||||||
|
# can still land a stop-monitoring command.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./blind_stop.sh <host> [tcp_port]
|
||||||
|
#
|
||||||
|
# Env:
|
||||||
|
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||||
|
# Set to http://localhost:8001/api/sfm to route through
|
||||||
|
# Terra-View's proxy.
|
||||||
|
# MAX_ATTEMPTS Default: 600
|
||||||
|
# SLEEP_S Default: 0 (no backoff — hammer it)
|
||||||
|
# MAX_TIME_S Default: 15
|
||||||
|
# CONNECT_TIMEOUT Default: 5
|
||||||
|
# REPEAT Frames per TCP session (default 3 — increases hit rate
|
||||||
|
# if the device is busy reading its own buffer).
|
||||||
|
# STOP_ON_OK Default: 1. Set to 0 to keep hammering indefinitely
|
||||||
|
# even after successful sends (every 503 means the device
|
||||||
|
# is in *another* session, every 200 means our bytes got
|
||||||
|
# through — but the device may not have processed them).
|
||||||
|
|
||||||
|
set -u
|
||||||
|
|
||||||
|
host="${1:-}"
|
||||||
|
tcp_port="${2:-9034}"
|
||||||
|
if [[ -z "$host" ]]; then
|
||||||
|
echo "usage: $0 <host> [tcp_port]" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||||
|
max_attempts="${MAX_ATTEMPTS:-600}"
|
||||||
|
sleep_s="${SLEEP_S:-0}"
|
||||||
|
max_time_s="${MAX_TIME_S:-15}"
|
||||||
|
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||||
|
repeat="${REPEAT:-3}"
|
||||||
|
stop_on_ok="${STOP_ON_OK:-1}"
|
||||||
|
|
||||||
|
url="${base}/device/stop_monitoring_blind?host=${host}&tcp_port=${tcp_port}&connect_timeout=${connect_timeout}&repeat=${repeat}"
|
||||||
|
|
||||||
|
echo "blind_stop: target ${host}:${tcp_port} connect_timeout=${connect_timeout}s repeat=${repeat}"
|
||||||
|
echo "blind_stop: POST ${url}"
|
||||||
|
echo "blind_stop: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
|
||||||
|
echo "blind_stop: stop_on_ok=${stop_on_ok}"
|
||||||
|
echo
|
||||||
|
|
||||||
|
ok_count=0
|
||||||
|
busy_count=0
|
||||||
|
err_count=0
|
||||||
|
started=$(date +%s)
|
||||||
|
|
||||||
|
for ((i=1; i<=max_attempts; i++)); do
|
||||||
|
printf "[%4d] %s " "$i" "$(date +%H:%M:%S)"
|
||||||
|
http_code=$(curl -sS -o /tmp/blind_resp.$$ -w "%{http_code}" \
|
||||||
|
--max-time "$max_time_s" \
|
||||||
|
-X POST "$url" || echo "000")
|
||||||
|
body=$(cat /tmp/blind_resp.$$ 2>/dev/null || true)
|
||||||
|
rm -f /tmp/blind_resp.$$
|
||||||
|
|
||||||
|
case "$http_code" in
|
||||||
|
200|201)
|
||||||
|
ok_count=$((ok_count + 1))
|
||||||
|
echo "SENT $body"
|
||||||
|
if [[ "$stop_on_ok" == "1" ]]; then
|
||||||
|
elapsed=$(( $(date +%s) - started ))
|
||||||
|
echo
|
||||||
|
echo "blind_stop: success after ${i} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}"
|
||||||
|
echo "blind_stop: NEXT — wait ~10s, then try the full rescue:"
|
||||||
|
echo " /home/serversdown/seismo-relay/scripts/rescue_device.sh ${host} ${tcp_port}"
|
||||||
|
exit 0
|
||||||
|
fi
|
||||||
|
;;
|
||||||
|
503)
|
||||||
|
busy_count=$((busy_count + 1))
|
||||||
|
echo "busy (503)"
|
||||||
|
;;
|
||||||
|
000)
|
||||||
|
err_count=$((err_count + 1))
|
||||||
|
echo "curl error"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
err_count=$((err_count + 1))
|
||||||
|
echo "HTTP $http_code $body" | head -c 400
|
||||||
|
echo
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
[[ "$sleep_s" != "0" ]] && sleep "$sleep_s"
|
||||||
|
done
|
||||||
|
|
||||||
|
elapsed=$(( $(date +%s) - started ))
|
||||||
|
echo
|
||||||
|
echo "blind_stop: gave up after ${max_attempts} attempts (${elapsed}s). ok=${ok_count} busy=${busy_count} err=${err_count}" >&2
|
||||||
|
exit 1
|
||||||
Executable
+99
@@ -0,0 +1,99 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Rescue an uncooperative MiniMate that's busy with another ACH session.
|
||||||
|
#
|
||||||
|
# Hammers POST /device/rescue in a tight loop with a short timeout. When the
|
||||||
|
# device is in an ACH session our SYN either gets refused or silently dropped
|
||||||
|
# (5s connect timeout inside the endpoint) and we retry immediately. When the
|
||||||
|
# device is between sessions, our TCP wins, the endpoint disables Auto Call
|
||||||
|
# Home and erases events inside the same session, then returns success.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./rescue_device.sh <host> [tcp_port] [--no-erase] [--no-disable-ach]
|
||||||
|
#
|
||||||
|
# Examples:
|
||||||
|
# ./rescue_device.sh 166.246.130.1 9034
|
||||||
|
# ./rescue_device.sh 166.246.130.1 9034 --no-erase # just silence it
|
||||||
|
#
|
||||||
|
# Environment:
|
||||||
|
# SFM_BASE_URL Defaults to http://localhost:8200 (SFM direct).
|
||||||
|
# Set to http://localhost:8001/api/sfm to route through
|
||||||
|
# Terra-View's proxy. Direct mode avoids the proxy's
|
||||||
|
# 60s timeout, which matters for long-running endpoints.
|
||||||
|
# MAX_ATTEMPTS Cap on retries (default 600 ≈ 30+ min).
|
||||||
|
# SLEEP_S Backoff between attempts (default 1).
|
||||||
|
# MAX_TIME_S Per-request timeout (default 60).
|
||||||
|
# CONNECT_TIMEOUT TCP connect timeout (default 5).
|
||||||
|
# RECV_TIMEOUT Per-frame S3 recv timeout (default 5). If POLL or any
|
||||||
|
# subsequent frame doesn't respond within this window, the
|
||||||
|
# rescue endpoint bails and this script retries.
|
||||||
|
|
||||||
|
set -u
|
||||||
|
|
||||||
|
host="${1:-}"
|
||||||
|
tcp_port="${2:-9034}"
|
||||||
|
shift 2 2>/dev/null || shift $# 2>/dev/null
|
||||||
|
|
||||||
|
if [[ -z "$host" ]]; then
|
||||||
|
echo "usage: $0 <host> [tcp_port] [--no-erase] [--no-disable-ach]" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
disable_ach="true"
|
||||||
|
erase="true"
|
||||||
|
for arg in "$@"; do
|
||||||
|
case "$arg" in
|
||||||
|
--no-erase) erase="false" ;;
|
||||||
|
--no-disable-ach) disable_ach="false" ;;
|
||||||
|
*) echo "unknown flag: $arg" >&2; exit 2 ;;
|
||||||
|
esac
|
||||||
|
done
|
||||||
|
|
||||||
|
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||||
|
max_attempts="${MAX_ATTEMPTS:-600}"
|
||||||
|
sleep_s="${SLEEP_S:-1}"
|
||||||
|
max_time_s="${MAX_TIME_S:-60}"
|
||||||
|
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||||
|
recv_timeout="${RECV_TIMEOUT:-5}"
|
||||||
|
|
||||||
|
url="${base}/device/rescue?host=${host}&tcp_port=${tcp_port}&disable_ach=${disable_ach}&erase=${erase}&connect_timeout=${connect_timeout}&recv_timeout=${recv_timeout}"
|
||||||
|
|
||||||
|
echo "rescue: target ${host}:${tcp_port} disable_ach=${disable_ach} erase=${erase}"
|
||||||
|
echo "rescue: connect_timeout=${connect_timeout}s recv_timeout=${recv_timeout}s"
|
||||||
|
echo "rescue: POST ${url}"
|
||||||
|
echo "rescue: up to ${max_attempts} attempts, ${sleep_s}s between, ${max_time_s}s per request"
|
||||||
|
echo
|
||||||
|
|
||||||
|
started=$(date +%s)
|
||||||
|
for ((i=1; i<=max_attempts; i++)); do
|
||||||
|
printf "[%3d] %s " "$i" "$(date +%H:%M:%S)"
|
||||||
|
http_code=$(curl -sS -o /tmp/rescue_resp.$$ -w "%{http_code}" \
|
||||||
|
--max-time "$max_time_s" \
|
||||||
|
-X POST "$url" || echo "000")
|
||||||
|
body=$(cat /tmp/rescue_resp.$$ 2>/dev/null || true)
|
||||||
|
rm -f /tmp/rescue_resp.$$
|
||||||
|
|
||||||
|
case "$http_code" in
|
||||||
|
200|201)
|
||||||
|
elapsed=$(( $(date +%s) - started ))
|
||||||
|
echo "OK (${elapsed}s total)"
|
||||||
|
echo "$body"
|
||||||
|
exit 0
|
||||||
|
;;
|
||||||
|
503)
|
||||||
|
# Connection refused / timeout — device busy in another session. Retry fast.
|
||||||
|
echo "busy (503)"
|
||||||
|
;;
|
||||||
|
000)
|
||||||
|
echo "curl error (network)"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "HTTP $http_code"
|
||||||
|
echo " $body" | head -c 400
|
||||||
|
echo
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
sleep "$sleep_s"
|
||||||
|
done
|
||||||
|
|
||||||
|
echo "rescue: gave up after ${max_attempts} attempts" >&2
|
||||||
|
exit 1
|
||||||
Executable
+44
@@ -0,0 +1,44 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Hold a single TCP session open and drip stop-monitoring frames at a slow
|
||||||
|
# rate, so the device's UART RX FIFO has time to drain between sends.
|
||||||
|
#
|
||||||
|
# Use when high-rate spam isn't landing — typically because the device's
|
||||||
|
# firmware is too busy to drain its serial buffer fast enough and bytes
|
||||||
|
# are being lost to UART overrun.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./slow_drip.sh <host> [tcp_port] [duration_s]
|
||||||
|
#
|
||||||
|
# Env:
|
||||||
|
# DURATION Default: 120 (seconds; arg 3 overrides). Clamped 1..600.
|
||||||
|
# INTERVAL Seconds between drip sends (default 3). Lower = more
|
||||||
|
# aggressive, more risk of FIFO overrun. Higher = safer
|
||||||
|
# but fewer total drips per duration.
|
||||||
|
# CONNECT_TIMEOUT Default: 5
|
||||||
|
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||||
|
|
||||||
|
set -u
|
||||||
|
|
||||||
|
host="${1:-}"
|
||||||
|
tcp_port="${2:-9034}"
|
||||||
|
duration="${3:-${DURATION:-120}}"
|
||||||
|
if [[ -z "$host" ]]; then
|
||||||
|
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||||
|
interval="${INTERVAL:-3}"
|
||||||
|
connect_timeout="${CONNECT_TIMEOUT:-5}"
|
||||||
|
|
||||||
|
url="${base}/device/stop_monitoring_slow_drip?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&interval_s=${interval}&connect_timeout=${connect_timeout}"
|
||||||
|
|
||||||
|
echo "slow_drip: target ${host}:${tcp_port} duration=${duration}s interval=${interval}s connect_timeout=${connect_timeout}s"
|
||||||
|
echo "slow_drip: POST ${url}"
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Give curl enough slack to wait out the duration plus a buffer
|
||||||
|
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 30 }')
|
||||||
|
|
||||||
|
curl -sS --max-time "$max_time" -X POST "$url"
|
||||||
|
echo
|
||||||
Executable
+48
@@ -0,0 +1,48 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Hammer a device with blind stop-monitoring sessions as fast as possible.
|
||||||
|
# Single HTTP call kicks off the burst inside SFM (no per-attempt HTTP
|
||||||
|
# overhead). Default: 10 seconds, ~500 ms per attempt = ~20 attempts/sec.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./spam_stop.sh <host> [tcp_port] [duration_s]
|
||||||
|
#
|
||||||
|
# Examples:
|
||||||
|
# ./spam_stop.sh 166.246.130.1 # 10s burst
|
||||||
|
# ./spam_stop.sh 166.246.130.1 9034 30 # 30s burst
|
||||||
|
# DURATION=60 CONNECT_TIMEOUT=0.2 ./spam_stop.sh 166.246.130.1
|
||||||
|
#
|
||||||
|
# Env:
|
||||||
|
# SFM_BASE_URL Default: http://localhost:8200 (SFM direct).
|
||||||
|
# Set to http://localhost:8001/api/sfm to route through
|
||||||
|
# Terra-View's proxy — but note the proxy has a 60s
|
||||||
|
# timeout, so long bursts need direct mode.
|
||||||
|
# DURATION Default: 10 (seconds; arg 3 overrides)
|
||||||
|
# CONNECT_TIMEOUT Default: 0.5 (seconds)
|
||||||
|
# REPEAT Default: 3 (stop frames per TCP session)
|
||||||
|
|
||||||
|
set -u
|
||||||
|
|
||||||
|
host="${1:-}"
|
||||||
|
tcp_port="${2:-9034}"
|
||||||
|
duration="${3:-${DURATION:-10}}"
|
||||||
|
|
||||||
|
if [[ -z "$host" ]]; then
|
||||||
|
echo "usage: $0 <host> [tcp_port] [duration_s]" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||||
|
connect_timeout="${CONNECT_TIMEOUT:-0.5}"
|
||||||
|
repeat="${REPEAT:-3}"
|
||||||
|
|
||||||
|
url="${base}/device/stop_monitoring_spam?host=${host}&tcp_port=${tcp_port}&duration_s=${duration}&connect_timeout=${connect_timeout}&repeat=${repeat}"
|
||||||
|
|
||||||
|
echo "spam_stop: target ${host}:${tcp_port} duration=${duration}s connect_timeout=${connect_timeout}s repeat=${repeat}"
|
||||||
|
echo "spam_stop: POST ${url}"
|
||||||
|
echo
|
||||||
|
|
||||||
|
# Give curl enough slack to wait out the duration plus a buffer
|
||||||
|
max_time=$(awk -v d="$duration" 'BEGIN { printf "%d", d + 10 }')
|
||||||
|
|
||||||
|
curl -sS --max-time "$max_time" -X POST "$url"
|
||||||
|
echo
|
||||||
Executable
+58
@@ -0,0 +1,58 @@
|
|||||||
|
#!/usr/bin/env bash
|
||||||
|
# Passive monitor for a misbehaving unit. Every INTERVAL seconds, attempts
|
||||||
|
# a single short TCP probe + storage_range read and logs the result. Designed
|
||||||
|
# to run unattended for hours/days and tell you when the unit comes back.
|
||||||
|
#
|
||||||
|
# Usage:
|
||||||
|
# ./watch_unit.sh <host> [tcp_port]
|
||||||
|
#
|
||||||
|
# Env:
|
||||||
|
# INTERVAL Seconds between checks (default 300 = 5 min)
|
||||||
|
# LOG_FILE Append results here (default /tmp/watch_<host>.log)
|
||||||
|
# SFM_BASE_URL Default: http://localhost:8200
|
||||||
|
|
||||||
|
set -u
|
||||||
|
|
||||||
|
host="${1:-}"
|
||||||
|
tcp_port="${2:-9034}"
|
||||||
|
if [[ -z "$host" ]]; then
|
||||||
|
echo "usage: $0 <host> [tcp_port]" >&2
|
||||||
|
exit 2
|
||||||
|
fi
|
||||||
|
|
||||||
|
interval="${INTERVAL:-300}"
|
||||||
|
log_file="${LOG_FILE:-/tmp/watch_${host}.log}"
|
||||||
|
base="${SFM_BASE_URL:-http://localhost:8200}"
|
||||||
|
|
||||||
|
url="${base}/device/events/storage_range?host=${host}&tcp_port=${tcp_port}"
|
||||||
|
|
||||||
|
echo "watch_unit: target ${host}:${tcp_port} interval=${interval}s log=${log_file}"
|
||||||
|
echo "watch_unit: Ctrl-C to stop"
|
||||||
|
|
||||||
|
while true; do
|
||||||
|
ts=$(date '+%Y-%m-%d %H:%M:%S')
|
||||||
|
http_code=$(curl -sS -o /tmp/watch_resp.$$ -w "%{http_code}" \
|
||||||
|
--max-time 20 "$url" || echo "000")
|
||||||
|
body=$(cat /tmp/watch_resp.$$ 2>/dev/null || true)
|
||||||
|
rm -f /tmp/watch_resp.$$
|
||||||
|
|
||||||
|
case "$http_code" in
|
||||||
|
200|201)
|
||||||
|
# Strip the raw_hex for readability
|
||||||
|
summary=$(echo "$body" | sed 's/"raw_hex":"[^"]*",*//; s/,*$//' | head -c 200)
|
||||||
|
echo "$ts REACHABLE $summary" | tee -a "$log_file"
|
||||||
|
;;
|
||||||
|
502|503)
|
||||||
|
err=$(echo "$body" | head -c 150)
|
||||||
|
echo "$ts ERROR_$http_code $err" | tee -a "$log_file"
|
||||||
|
;;
|
||||||
|
000)
|
||||||
|
echo "$ts CURL_FAIL (network/timeout)" | tee -a "$log_file"
|
||||||
|
;;
|
||||||
|
*)
|
||||||
|
echo "$ts HTTP_$http_code $(echo "$body" | head -c 150)" | tee -a "$log_file"
|
||||||
|
;;
|
||||||
|
esac
|
||||||
|
|
||||||
|
sleep "$interval"
|
||||||
|
done
|
||||||
@@ -491,6 +491,75 @@ class SeismoDb:
|
|||||||
)
|
)
|
||||||
return cur.rowcount > 0
|
return cur.rowcount > 0
|
||||||
|
|
||||||
|
def delete_event(self, event_id: str) -> Optional[dict]:
|
||||||
|
"""
|
||||||
|
Hard-delete one event row by id. Returns the deleted row (so the
|
||||||
|
caller can clean up any on-disk files referenced by it) or None
|
||||||
|
if no row matched.
|
||||||
|
"""
|
||||||
|
with self._connect() as conn:
|
||||||
|
row = conn.execute(
|
||||||
|
"SELECT * FROM events WHERE id = ?", (event_id,),
|
||||||
|
).fetchone()
|
||||||
|
if row is None:
|
||||||
|
return None
|
||||||
|
conn.execute("DELETE FROM events WHERE id = ?", (event_id,))
|
||||||
|
return dict(row)
|
||||||
|
|
||||||
|
def delete_events_bulk(
|
||||||
|
self,
|
||||||
|
serial: Optional[str] = None,
|
||||||
|
from_dt: Optional[datetime.datetime] = None,
|
||||||
|
to_dt: Optional[datetime.datetime] = None,
|
||||||
|
false_trigger: Optional[bool] = None,
|
||||||
|
ids: Optional[list[str]] = None,
|
||||||
|
) -> list[dict]:
|
||||||
|
"""
|
||||||
|
Hard-delete events matching the given filters. Returns the list
|
||||||
|
of deleted row dicts. Refuses to delete with no filters at all
|
||||||
|
(would wipe the whole table) — raises ValueError.
|
||||||
|
|
||||||
|
Filter semantics match query_events: serial / from_dt / to_dt /
|
||||||
|
false_trigger combine with AND. `ids` is an additional inclusion
|
||||||
|
list (event_id IN (...)); if supplied alongside other filters,
|
||||||
|
only rows matching all conditions are deleted.
|
||||||
|
"""
|
||||||
|
clauses: list[str] = []
|
||||||
|
params: list = []
|
||||||
|
|
||||||
|
if serial:
|
||||||
|
clauses.append("serial = ?")
|
||||||
|
params.append(serial)
|
||||||
|
if from_dt:
|
||||||
|
clauses.append("timestamp >= ?")
|
||||||
|
params.append(from_dt.isoformat())
|
||||||
|
if to_dt:
|
||||||
|
clauses.append("timestamp <= ?")
|
||||||
|
params.append(to_dt.isoformat())
|
||||||
|
if false_trigger is not None:
|
||||||
|
clauses.append("false_trigger = ?")
|
||||||
|
params.append(1 if false_trigger else 0)
|
||||||
|
if ids:
|
||||||
|
placeholders = ",".join("?" * len(ids))
|
||||||
|
clauses.append(f"id IN ({placeholders})")
|
||||||
|
params.extend(ids)
|
||||||
|
|
||||||
|
if not clauses:
|
||||||
|
raise ValueError(
|
||||||
|
"delete_events_bulk refuses to delete with no filters "
|
||||||
|
"(would wipe the entire events table)"
|
||||||
|
)
|
||||||
|
|
||||||
|
where = "WHERE " + " AND ".join(clauses)
|
||||||
|
|
||||||
|
with self._connect() as conn:
|
||||||
|
rows = conn.execute(
|
||||||
|
f"SELECT * FROM events {where}", params,
|
||||||
|
).fetchall()
|
||||||
|
if rows:
|
||||||
|
conn.execute(f"DELETE FROM events {where}", params)
|
||||||
|
return [dict(r) for r in rows]
|
||||||
|
|
||||||
def update_event_review(self, event_id: str, review: dict) -> bool:
|
def update_event_review(self, event_id: str, review: dict) -> bool:
|
||||||
"""
|
"""
|
||||||
Sync derived index columns from a sidecar's `review` block.
|
Sync derived index columns from a sidecar's `review` block.
|
||||||
|
|||||||
+729
-11
@@ -36,6 +36,7 @@ from __future__ import annotations
|
|||||||
|
|
||||||
import datetime
|
import datetime
|
||||||
import logging
|
import logging
|
||||||
|
import socket
|
||||||
import sys
|
import sys
|
||||||
import tempfile
|
import tempfile
|
||||||
import threading
|
import threading
|
||||||
@@ -63,7 +64,9 @@ from minimateplus.protocol import ProtocolError
|
|||||||
from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp
|
from minimateplus.models import CallHomeConfig, ComplianceConfig, DeviceInfo, Event, PeakValues, ProjectInfo, Timestamp
|
||||||
from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT
|
from minimateplus.transport import TcpTransport, DEFAULT_TCP_PORT
|
||||||
from minimateplus.blastware_file import write_blastware_file, blastware_filename
|
from minimateplus.blastware_file import write_blastware_file, blastware_filename
|
||||||
from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform
|
from minimateplus.client import _decode_a5_metadata_into, _decode_a5_waveform, _decode_event_count
|
||||||
|
from minimateplus.framing import build_bw_write_frame, SESSION_RESET, POLL_PROBE, POLL_DATA
|
||||||
|
from minimateplus.protocol import SUB_STOP_MONITORING
|
||||||
from sfm import event_hdf5
|
from sfm import event_hdf5
|
||||||
from sfm.cache import SFMCache, get_cache
|
from sfm.cache import SFMCache, get_cache
|
||||||
from sfm.database import SeismoDb
|
from sfm.database import SeismoDb
|
||||||
@@ -86,7 +89,7 @@ app = FastAPI(
|
|||||||
"Implements the minimateplus RS-232 protocol library.\n"
|
"Implements the minimateplus RS-232 protocol library.\n"
|
||||||
"Proxied by terra-view at /api/sfm/*."
|
"Proxied by terra-view at /api/sfm/*."
|
||||||
),
|
),
|
||||||
version="0.1.0",
|
version="0.17.0",
|
||||||
)
|
)
|
||||||
|
|
||||||
# Allow requests from the waveform viewer opened as a local file (file://)
|
# Allow requests from the waveform viewer opened as a local file (file://)
|
||||||
@@ -268,7 +271,8 @@ def _build_client(
|
|||||||
baud: int,
|
baud: int,
|
||||||
host: Optional[str],
|
host: Optional[str],
|
||||||
tcp_port: int,
|
tcp_port: int,
|
||||||
timeout: float = 30.0,
|
timeout: float = 10.0,
|
||||||
|
connect_timeout: Optional[float] = None,
|
||||||
) -> MiniMateClient:
|
) -> MiniMateClient:
|
||||||
"""
|
"""
|
||||||
Return a MiniMateClient configured for either serial or TCP transport.
|
Return a MiniMateClient configured for either serial or TCP transport.
|
||||||
@@ -276,12 +280,24 @@ def _build_client(
|
|||||||
TCP takes priority if *host* is supplied; otherwise *port* (serial) is used.
|
TCP takes priority if *host* is supplied; otherwise *port* (serial) is used.
|
||||||
Raises HTTPException(422) if neither is provided.
|
Raises HTTPException(422) if neither is provided.
|
||||||
|
|
||||||
|
Default *timeout* is 10s — the device usually responds in well under a
|
||||||
|
second over cellular; 10s leaves comfortable headroom for retransmits
|
||||||
|
while still failing reasonably fast when a unit is wedged.
|
||||||
|
|
||||||
Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform
|
Use timeout=120.0 (or higher) for endpoints that perform a full 5A waveform
|
||||||
download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over
|
download — a 70-second event at 1024 sps takes 2-3 minutes to transfer over
|
||||||
cellular and each individual recv must complete within the timeout window.
|
cellular and each individual recv must complete within the timeout window.
|
||||||
|
|
||||||
|
*connect_timeout* (TCP only) overrides the TcpTransport default (10s) for
|
||||||
|
the initial TCP SYN/handshake. Use a small value (e.g. 5s) in rescue/race
|
||||||
|
scenarios where the device is busy in another session and you want to
|
||||||
|
fail fast and retry quickly.
|
||||||
"""
|
"""
|
||||||
if host:
|
if host:
|
||||||
transport = TcpTransport(host, port=tcp_port)
|
if connect_timeout is not None:
|
||||||
|
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
|
||||||
|
else:
|
||||||
|
transport = TcpTransport(host, port=tcp_port)
|
||||||
log.debug("TCP transport: %s:%d timeout=%.0fs", host, tcp_port, timeout)
|
log.debug("TCP transport: %s:%d timeout=%.0fs", host, tcp_port, timeout)
|
||||||
return MiniMateClient(transport=transport, timeout=timeout)
|
return MiniMateClient(transport=transport, timeout=timeout)
|
||||||
elif port:
|
elif port:
|
||||||
@@ -1095,13 +1111,23 @@ def device_monitor_status(
|
|||||||
cached["_cached"] = True
|
cached["_cached"] = True
|
||||||
return cached
|
return cached
|
||||||
|
|
||||||
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
try:
|
||||||
try:
|
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||||
client.poll()
|
try:
|
||||||
except Exception as exc:
|
client.poll()
|
||||||
log.warning("monitor status poll retry: %s", exc)
|
except Exception as exc:
|
||||||
client.poll()
|
log.warning("monitor status poll retry: %s", exc)
|
||||||
status = client.get_monitor_status()
|
client.poll()
|
||||||
|
status = client.get_monitor_status()
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except ProtocolError as exc:
|
||||||
|
# Includes minimateplus.protocol.TimeoutError ("device unresponsive").
|
||||||
|
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||||
|
|
||||||
result: dict = {"is_monitoring": status.is_monitoring}
|
result: dict = {"is_monitoring": status.is_monitoring}
|
||||||
if status.battery_v is not None:
|
if status.battery_v is not None:
|
||||||
@@ -1117,6 +1143,529 @@ def device_monitor_status(
|
|||||||
return result
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/device/events/storage_range")
|
||||||
|
def device_events_storage_range(
|
||||||
|
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||||
|
baud: int = Query(38400, description="Serial baud rate"),
|
||||||
|
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Read the device's event storage range (SUB 0x06) — first and last
|
||||||
|
stored event keys. POLL handshake + one read; no connect(), no
|
||||||
|
config reads, no event walk. Completes in ~2 seconds.
|
||||||
|
|
||||||
|
Useful for checking whether the device has any stored events
|
||||||
|
without invoking the slow count_events() 1E/1F chain. Both keys =
|
||||||
|
`01110000` means the device is empty.
|
||||||
|
"""
|
||||||
|
log.info("GET /device/events/storage_range host=%s tcp_port=%s", host, tcp_port)
|
||||||
|
try:
|
||||||
|
def _do():
|
||||||
|
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||||
|
try:
|
||||||
|
client.poll()
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("storage_range poll retry: %s", exc)
|
||||||
|
client.poll()
|
||||||
|
proto = client._require_proto()
|
||||||
|
return proto.read_event_storage_range()
|
||||||
|
rng = _run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except ProtocolError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||||
|
|
||||||
|
data = bytes(rng.data)
|
||||||
|
result: dict = {"raw_len": len(data), "raw_hex": data.hex()}
|
||||||
|
if len(data) >= 8:
|
||||||
|
first_key = data[-8:-4].hex()
|
||||||
|
last_key = data[-4:].hex()
|
||||||
|
result["first_key"] = first_key
|
||||||
|
result["last_key"] = last_key
|
||||||
|
result["is_empty"] = (first_key == "01110000" and last_key == "01110000")
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@app.get("/device/events/index")
|
||||||
|
def device_events_index(
|
||||||
|
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||||
|
baud: int = Query(38400, description="Serial baud rate"),
|
||||||
|
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Read the device's event index (SUB 0x08) — returns the lifetime
|
||||||
|
event counter at data[10:12] (uint16 BE). POLL handshake + one
|
||||||
|
read; no connect(), no config reads, no event walk. ~2 seconds.
|
||||||
|
|
||||||
|
Note: this is a LIFETIME counter (events ever recorded) — it does
|
||||||
|
NOT decrement when events are erased. After an erase, the device
|
||||||
|
counter resets to 0 only on the next recorded event. For "are
|
||||||
|
there stored events right now?" use /device/events/storage_range
|
||||||
|
instead.
|
||||||
|
"""
|
||||||
|
log.info("GET /device/events/index host=%s tcp_port=%s", host, tcp_port)
|
||||||
|
try:
|
||||||
|
def _do():
|
||||||
|
with _build_client(port=port, baud=baud, host=host, tcp_port=tcp_port) as client:
|
||||||
|
try:
|
||||||
|
client.poll()
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("event_index poll retry: %s", exc)
|
||||||
|
client.poll()
|
||||||
|
proto = client._require_proto()
|
||||||
|
return proto.read_event_index()
|
||||||
|
idx_raw = _run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except ProtocolError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||||
|
|
||||||
|
raw = bytes(idx_raw)
|
||||||
|
result: dict = {"raw_len": len(raw), "raw_hex": raw.hex()}
|
||||||
|
try:
|
||||||
|
result["lifetime_count"] = _decode_event_count(raw)
|
||||||
|
except Exception as exc:
|
||||||
|
result["decode_error"] = str(exc)
|
||||||
|
return result
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/device/events/erase")
|
||||||
|
def device_events_erase(
|
||||||
|
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||||
|
baud: int = Query(38400, description="Serial baud rate"),
|
||||||
|
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Erase ALL stored events from the device memory.
|
||||||
|
|
||||||
|
Sequence: SUB 0xA3 → 0x1C → 0x06 → 0xA2 (confirmed 2026-04-11).
|
||||||
|
After this call the unit's event memory is empty and event keys reset
|
||||||
|
to 0x01110000. The device returns to its normal operating state
|
||||||
|
automatically — no restart-monitoring call is needed.
|
||||||
|
|
||||||
|
Note: this endpoint does NOT touch the ACH server's `ach_state.json`.
|
||||||
|
If a call-home subsequently lands on the ACH server, its post-erase
|
||||||
|
detection logic (max(device_keys) vs max_downloaded_key) handles the
|
||||||
|
key-counter rollback.
|
||||||
|
"""
|
||||||
|
log.info("POST /device/events/erase port=%s host=%s tcp_port=%s", port, host, tcp_port)
|
||||||
|
|
||||||
|
try:
|
||||||
|
def _do():
|
||||||
|
with _build_client(port, baud, host, tcp_port) as client:
|
||||||
|
client.connect()
|
||||||
|
client.delete_all_events()
|
||||||
|
_run_with_retry(_do, is_tcp=_is_tcp(host))
|
||||||
|
except HTTPException:
|
||||||
|
raise
|
||||||
|
except ProtocolError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Protocol error: {exc}") from exc
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Connection error: {exc}") from exc
|
||||||
|
except Exception as exc:
|
||||||
|
raise HTTPException(status_code=500, detail=f"Device error: {exc}") from exc
|
||||||
|
|
||||||
|
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
|
||||||
|
cleared = get_cache().clear_device(conn_key)
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"message": "Device event memory cleared",
|
||||||
|
"cache_cleared": cleared,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/device/stop_monitoring_blind")
|
||||||
|
def device_stop_monitoring_blind(
|
||||||
|
host: str = Query(..., description="TCP host — modem IP"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
|
||||||
|
repeat: int = Query(3, description="How many times to send the frame within one TCP session (default 3)"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Fire-and-forget Stop Monitoring (SUB 0x97). TCP-only.
|
||||||
|
|
||||||
|
Opens a TCP session, dumps the FULL handshake the device's protocol
|
||||||
|
state machine expects — `SESSION_RESET + POLL_PROBE + SESSION_RESET +
|
||||||
|
POLL_DATA` — and then N back-to-back copies of the stop-monitoring
|
||||||
|
frame. Does NOT read any S3 response. Succeeds as long as the bytes
|
||||||
|
left the socket.
|
||||||
|
|
||||||
|
The POLL handshake bytes are required: monitoring units ignore command
|
||||||
|
frames received without a preceding POLL exchange. Sending the POLL
|
||||||
|
bytes "blind" (without reading the responses) still works because the
|
||||||
|
device processes inbound bytes in order regardless of whether we drain
|
||||||
|
its outbound buffer.
|
||||||
|
|
||||||
|
Idempotent: the device processes extra copies of SUB 0x97 the same as
|
||||||
|
one (already-stopped is a no-op).
|
||||||
|
|
||||||
|
Returns the number of bytes sent. A 503 means the TCP connect failed
|
||||||
|
(device busy in another session — caller should retry).
|
||||||
|
"""
|
||||||
|
log.info(
|
||||||
|
"POST /device/stop_monitoring_blind host=%s tcp_port=%s connect_timeout=%.1fs repeat=%d",
|
||||||
|
host, tcp_port, connect_timeout, repeat,
|
||||||
|
)
|
||||||
|
if repeat < 1:
|
||||||
|
repeat = 1
|
||||||
|
|
||||||
|
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||||
|
payload = (
|
||||||
|
SESSION_RESET + POLL_PROBE
|
||||||
|
+ SESSION_RESET + POLL_DATA
|
||||||
|
+ (frame * repeat)
|
||||||
|
)
|
||||||
|
t0 = time.monotonic()
|
||||||
|
|
||||||
|
transport = TcpTransport(host, port=tcp_port, connect_timeout=connect_timeout)
|
||||||
|
try:
|
||||||
|
transport.connect()
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
|
||||||
|
|
||||||
|
try:
|
||||||
|
transport.write(payload)
|
||||||
|
except OSError as exc:
|
||||||
|
transport.disconnect()
|
||||||
|
raise HTTPException(status_code=502, detail=f"Send error: {exc}") from exc
|
||||||
|
finally:
|
||||||
|
transport.disconnect()
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "sent",
|
||||||
|
"bytes_sent": len(payload),
|
||||||
|
"frame_size": len(frame),
|
||||||
|
"repeat": repeat,
|
||||||
|
"elapsed_s": round(time.monotonic() - t0, 3),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/device/stop_monitoring_slow_drip")
|
||||||
|
def device_stop_monitoring_slow_drip(
|
||||||
|
host: str = Query(..., description="TCP host — modem IP"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
duration_s: float = Query(120.0, description="Total time to hold the session open (seconds)"),
|
||||||
|
interval_s: float = Query(3.0, description="Seconds between drip sends"),
|
||||||
|
connect_timeout: float = Query(5.0, description="TCP connect timeout"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Hold a single TCP session open for *duration_s* seconds and drip
|
||||||
|
stop-monitoring frames into the device at a slow rate so its UART
|
||||||
|
RX FIFO has time to drain between sends.
|
||||||
|
|
||||||
|
Sequence:
|
||||||
|
1. Open TCP session.
|
||||||
|
2. Send the wake preamble: SESSION_RESET + POLL_PROBE +
|
||||||
|
SESSION_RESET + POLL_DATA (so the device's protocol parser
|
||||||
|
is primed for a write command).
|
||||||
|
3. Wait interval_s for the device to drain.
|
||||||
|
4. Drip-send (SESSION_RESET + stop_monitoring_frame) every
|
||||||
|
interval_s until duration_s elapses.
|
||||||
|
5. Opportunistically drain any bytes the device sends back (so
|
||||||
|
the modem's TX queue doesn't fill up). Successful drains are
|
||||||
|
counted in `bytes_received` — non-zero strongly suggests the
|
||||||
|
device has started responding to us.
|
||||||
|
6. Close.
|
||||||
|
|
||||||
|
Designed for units whose firmware is too busy with event-recording
|
||||||
|
to keep up with high-rate spam. Heavy spam overruns the UART FIFO;
|
||||||
|
slow drip stays under it.
|
||||||
|
|
||||||
|
Compared to spam mode: ~40× fewer bytes/sec on the wire, but each
|
||||||
|
byte has a much higher chance of actually being parsed.
|
||||||
|
"""
|
||||||
|
log.info(
|
||||||
|
"POST /device/stop_monitoring_slow_drip host=%s tcp_port=%s duration=%.1fs interval=%.2fs connect_timeout=%.1fs",
|
||||||
|
host, tcp_port, duration_s, interval_s, connect_timeout,
|
||||||
|
)
|
||||||
|
duration_s = max(1.0, min(duration_s, 600.0)) # clamp 1s..10min
|
||||||
|
interval_s = max(0.1, min(interval_s, 30.0))
|
||||||
|
connect_timeout = max(0.1, connect_timeout)
|
||||||
|
|
||||||
|
stop_frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||||
|
preamble = (
|
||||||
|
SESSION_RESET + POLL_PROBE
|
||||||
|
+ SESSION_RESET + POLL_DATA
|
||||||
|
)
|
||||||
|
|
||||||
|
t0 = time.monotonic()
|
||||||
|
drips_sent = 0
|
||||||
|
bytes_sent = 0
|
||||||
|
bytes_received = 0
|
||||||
|
|
||||||
|
try:
|
||||||
|
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=503, detail=f"Connection error: {exc}") from exc
|
||||||
|
|
||||||
|
# Short read timeout so opportunistic drains don't block.
|
||||||
|
sock.settimeout(0.1)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Initial wake preamble.
|
||||||
|
try:
|
||||||
|
sock.sendall(preamble)
|
||||||
|
bytes_sent += len(preamble)
|
||||||
|
except OSError as exc:
|
||||||
|
raise HTTPException(status_code=502, detail=f"Preamble send failed: {exc}") from exc
|
||||||
|
|
||||||
|
# Initial settle.
|
||||||
|
time.sleep(interval_s)
|
||||||
|
|
||||||
|
# Try a non-blocking drain of any response to the wake.
|
||||||
|
try:
|
||||||
|
data = sock.recv(4096)
|
||||||
|
if data:
|
||||||
|
bytes_received += len(data)
|
||||||
|
log.info("slow_drip: device responded to wake preamble (%d bytes)", len(data))
|
||||||
|
except socket.timeout:
|
||||||
|
pass
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
deadline = t0 + duration_s
|
||||||
|
drip = SESSION_RESET + stop_frame # 2 + 21 = 23 bytes per drip
|
||||||
|
send_error: Optional[str] = None
|
||||||
|
|
||||||
|
while time.monotonic() < deadline:
|
||||||
|
try:
|
||||||
|
sock.sendall(drip)
|
||||||
|
bytes_sent += len(drip)
|
||||||
|
drips_sent += 1
|
||||||
|
except OSError as exc:
|
||||||
|
send_error = f"{exc}"
|
||||||
|
log.warning("slow_drip: send failed after %d drips: %s", drips_sent, exc)
|
||||||
|
break
|
||||||
|
|
||||||
|
# Drain any inbound bytes; ignore timeouts.
|
||||||
|
try:
|
||||||
|
data = sock.recv(4096)
|
||||||
|
if data:
|
||||||
|
bytes_received += len(data)
|
||||||
|
except socket.timeout:
|
||||||
|
pass
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Sleep the interval, but don't oversleep past the deadline.
|
||||||
|
remaining = deadline - time.monotonic()
|
||||||
|
if remaining <= 0:
|
||||||
|
break
|
||||||
|
time.sleep(min(interval_s, remaining))
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
sock.shutdown(socket.SHUT_RDWR)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
sock.close()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
elapsed = time.monotonic() - t0
|
||||||
|
log.info(
|
||||||
|
"slow_drip done — drips=%d bytes_sent=%d bytes_received=%d in %.1fs",
|
||||||
|
drips_sent, bytes_sent, bytes_received, elapsed,
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"status": "done",
|
||||||
|
"duration_s": round(elapsed, 2),
|
||||||
|
"drips_sent": drips_sent,
|
||||||
|
"bytes_sent": bytes_sent,
|
||||||
|
"bytes_received": bytes_received,
|
||||||
|
"preamble_bytes": len(preamble),
|
||||||
|
"drip_bytes": len(drip),
|
||||||
|
"send_error": send_error,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/device/stop_monitoring_spam")
|
||||||
|
def device_stop_monitoring_spam(
|
||||||
|
host: str = Query(..., description="TCP host — modem IP"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
duration_s: float = Query(10.0, description="How long to hammer the device for (seconds)"),
|
||||||
|
connect_timeout: float = Query(0.5, description="Per-attempt TCP connect timeout (default 0.5s)"),
|
||||||
|
repeat: int = Query(3, description="Stop frames per TCP session (default 3)"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Hammer the device with blind stop-monitoring sessions as fast as
|
||||||
|
possible for `duration_s` seconds. Each attempt: open TCP → write
|
||||||
|
SESSION_RESET + POLL handshake + STOP frames × repeat → close. No
|
||||||
|
response is read.
|
||||||
|
|
||||||
|
Designed for units that are aggressively calling home — short
|
||||||
|
connect_timeout (default 500 ms) means every failed attempt loses
|
||||||
|
only that much time before retrying, so we can fit several attempts
|
||||||
|
per second even when the modem is mostly busy with its own outbound
|
||||||
|
sessions.
|
||||||
|
|
||||||
|
Single HTTP call kicks off the whole burst; counters are returned
|
||||||
|
when it finishes. No streaming; if you want live progress, watch
|
||||||
|
SFM logs.
|
||||||
|
"""
|
||||||
|
log.info(
|
||||||
|
"POST /device/stop_monitoring_spam host=%s tcp_port=%s duration=%.1fs connect_timeout=%.3fs repeat=%d",
|
||||||
|
host, tcp_port, duration_s, connect_timeout, repeat,
|
||||||
|
)
|
||||||
|
if repeat < 1:
|
||||||
|
repeat = 1
|
||||||
|
duration_s = max(0.1, min(duration_s, 300.0)) # clamp 0.1s..5min
|
||||||
|
connect_timeout = max(0.05, connect_timeout)
|
||||||
|
|
||||||
|
frame = build_bw_write_frame(SUB_STOP_MONITORING, b"")
|
||||||
|
payload = (
|
||||||
|
SESSION_RESET + POLL_PROBE
|
||||||
|
+ SESSION_RESET + POLL_DATA
|
||||||
|
+ (frame * repeat)
|
||||||
|
)
|
||||||
|
|
||||||
|
t0 = time.monotonic()
|
||||||
|
deadline = t0 + duration_s
|
||||||
|
sent_ok = 0
|
||||||
|
connect_failed = 0
|
||||||
|
write_failed = 0
|
||||||
|
|
||||||
|
while time.monotonic() < deadline:
|
||||||
|
try:
|
||||||
|
sock = socket.create_connection((host, tcp_port), timeout=connect_timeout)
|
||||||
|
except OSError:
|
||||||
|
connect_failed += 1
|
||||||
|
continue
|
||||||
|
try:
|
||||||
|
sock.sendall(payload)
|
||||||
|
sent_ok += 1
|
||||||
|
except OSError:
|
||||||
|
write_failed += 1
|
||||||
|
finally:
|
||||||
|
try:
|
||||||
|
sock.shutdown(socket.SHUT_RDWR)
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
try:
|
||||||
|
sock.close()
|
||||||
|
except OSError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
elapsed = time.monotonic() - t0
|
||||||
|
total = sent_ok + connect_failed + write_failed
|
||||||
|
log.info(
|
||||||
|
"stop_monitoring_spam done — sent=%d connect_failed=%d write_failed=%d in %.2fs",
|
||||||
|
sent_ok, connect_failed, write_failed, elapsed,
|
||||||
|
)
|
||||||
|
return {
|
||||||
|
"status": "done",
|
||||||
|
"duration_s": round(elapsed, 2),
|
||||||
|
"sent_ok": sent_ok,
|
||||||
|
"connect_failed": connect_failed,
|
||||||
|
"write_failed": write_failed,
|
||||||
|
"total_attempts": total,
|
||||||
|
"rate_attempts_per_s": round(total / elapsed, 1) if elapsed > 0 else 0,
|
||||||
|
"payload_bytes": len(payload),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/device/rescue")
|
||||||
|
def device_rescue(
|
||||||
|
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||||
|
baud: int = Query(38400, description="Serial baud rate"),
|
||||||
|
host: Optional[str] = Query(None, description="TCP host — modem IP or ACH relay"),
|
||||||
|
tcp_port: int = Query(DEFAULT_TCP_PORT, description=f"TCP port (default {DEFAULT_TCP_PORT})"),
|
||||||
|
connect_timeout: float = Query(5.0, description="TCP connect timeout in seconds (default 5)"),
|
||||||
|
recv_timeout: float = Query(5.0, description="Per-frame S3 recv timeout in seconds (default 5)"),
|
||||||
|
disable_ach: bool = Query(True, description="Disable Auto Call Home on the device before erasing"),
|
||||||
|
erase: bool = Query(True, description="Erase all stored events after disabling ACH"),
|
||||||
|
) -> dict:
|
||||||
|
"""
|
||||||
|
Rescue an uncooperative unit by squeezing all maintenance work into a
|
||||||
|
single TCP session.
|
||||||
|
|
||||||
|
Designed for devices that are actively calling home to a separate ACH
|
||||||
|
server (BW or otherwise). While we hold this TCP session open the
|
||||||
|
modem cannot accept an inbound ACH call, so the order matters:
|
||||||
|
|
||||||
|
1. Short-timeout TCP connect (fails fast if the device is busy in
|
||||||
|
another session — the caller should retry in a tight loop).
|
||||||
|
2. POLL handshake.
|
||||||
|
3. (optional) Write call_home config with auto_call_home_enabled=false
|
||||||
|
so the device stops calling out even after we drop the session.
|
||||||
|
4. (optional) Erase all stored events (0xA3 → 0x1C → 0x06 → 0xA2).
|
||||||
|
5. Close the TCP session.
|
||||||
|
|
||||||
|
Both `disable_ach` and `erase` default to true. Pass `?erase=false` if
|
||||||
|
you only want to silence the unit without wiping its events.
|
||||||
|
|
||||||
|
Caller pattern (bash):
|
||||||
|
|
||||||
|
until curl -sS --max-time 30 -X POST \\
|
||||||
|
"http://localhost:8001/api/sfm/device/rescue?host=$IP&tcp_port=$P"; do
|
||||||
|
sleep 1
|
||||||
|
done
|
||||||
|
"""
|
||||||
|
log.info(
|
||||||
|
"POST /device/rescue host=%s tcp_port=%s connect_timeout=%.1fs recv_timeout=%.1fs disable_ach=%s erase=%s",
|
||||||
|
host, tcp_port, connect_timeout, recv_timeout, disable_ach, erase,
|
||||||
|
)
|
||||||
|
|
||||||
|
steps: list[dict] = []
|
||||||
|
t0 = time.monotonic()
|
||||||
|
|
||||||
|
try:
|
||||||
|
with _build_client(
|
||||||
|
port, baud, host, tcp_port,
|
||||||
|
timeout=recv_timeout,
|
||||||
|
connect_timeout=connect_timeout,
|
||||||
|
) as client:
|
||||||
|
steps.append({"step": "tcp_connect", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||||
|
|
||||||
|
try:
|
||||||
|
client.poll()
|
||||||
|
except Exception as exc:
|
||||||
|
log.warning("rescue: poll retry: %s", exc)
|
||||||
|
client.poll()
|
||||||
|
steps.append({"step": "poll", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||||
|
|
||||||
|
if disable_ach:
|
||||||
|
client.set_call_home_config(auto_call_home_enabled=False)
|
||||||
|
steps.append({"step": "disable_ach", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||||
|
|
||||||
|
if erase:
|
||||||
|
client.delete_all_events()
|
||||||
|
steps.append({"step": "erase", "ok": True, "elapsed_s": round(time.monotonic() - t0, 2)})
|
||||||
|
|
||||||
|
except ProtocolError as exc:
|
||||||
|
steps.append({"step": "error", "ok": False, "detail": f"protocol: {exc}"})
|
||||||
|
raise HTTPException(status_code=502, detail={"message": f"Protocol error: {exc}", "steps": steps}) from exc
|
||||||
|
except OSError as exc:
|
||||||
|
steps.append({"step": "error", "ok": False, "detail": f"socket: {exc}"})
|
||||||
|
# Connection refused / timed out → device busy in another session. Caller should retry.
|
||||||
|
raise HTTPException(status_code=503, detail={"message": f"Connection error: {exc}", "steps": steps}) from exc
|
||||||
|
except Exception as exc:
|
||||||
|
steps.append({"step": "error", "ok": False, "detail": str(exc)})
|
||||||
|
raise HTTPException(status_code=500, detail={"message": f"Device error: {exc}", "steps": steps}) from exc
|
||||||
|
|
||||||
|
conn_key = SFMCache.make_conn_key(host, tcp_port, port, baud)
|
||||||
|
cleared = get_cache().clear_device(conn_key)
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"elapsed_s": round(time.monotonic() - t0, 2),
|
||||||
|
"disable_ach": disable_ach,
|
||||||
|
"erase": erase,
|
||||||
|
"steps": steps,
|
||||||
|
"cache_cleared": cleared,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
@app.post("/device/monitor/start")
|
@app.post("/device/monitor/start")
|
||||||
def device_monitor_start(
|
def device_monitor_start(
|
||||||
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
port: Optional[str] = Query(None, description="Serial port (e.g. COM5)"),
|
||||||
@@ -1403,6 +1952,175 @@ def db_set_false_trigger(
|
|||||||
return {"status": "ok", "event_id": event_id, "false_trigger": value}
|
return {"status": "ok", "event_id": event_id, "false_trigger": value}
|
||||||
|
|
||||||
|
|
||||||
|
def _cleanup_event_files(row: dict) -> dict:
|
||||||
|
"""
|
||||||
|
Best-effort cleanup of on-disk waveform / sidecar / pickle / hdf5 files
|
||||||
|
associated with a deleted event row. Returns a dict of {kind: bool} for
|
||||||
|
what was actually removed (true) vs not found / failed (false).
|
||||||
|
"""
|
||||||
|
serial = row.get("serial")
|
||||||
|
bw_name = row.get("blastware_filename")
|
||||||
|
a5_name = row.get("a5_pickle_filename")
|
||||||
|
sc_name = row.get("sidecar_filename")
|
||||||
|
removed: dict = {}
|
||||||
|
if not serial:
|
||||||
|
return removed
|
||||||
|
store = _get_store()
|
||||||
|
# blastware_filename is the "base" — other files derive their paths from it
|
||||||
|
# via WaveformStore helpers. Sidecar and a5 may also be stored under their
|
||||||
|
# own column values if they ever diverged historically.
|
||||||
|
base_name = bw_name or a5_name or sc_name
|
||||||
|
if base_name:
|
||||||
|
bw_path, a5_path = store.paths_for(serial, base_name)
|
||||||
|
sc_path = store.sidecar_path_for(serial, base_name)
|
||||||
|
h5_path = store.hdf5_path_for(serial, base_name)
|
||||||
|
for kind, p in [("blastware", bw_path), ("a5_pickle", a5_path),
|
||||||
|
("sidecar", sc_path), ("hdf5", h5_path)]:
|
||||||
|
try:
|
||||||
|
if p.exists():
|
||||||
|
p.unlink()
|
||||||
|
removed[kind] = True
|
||||||
|
except OSError as exc:
|
||||||
|
log.warning("file cleanup failed for %s (%s): %s", p, kind, exc)
|
||||||
|
removed[kind] = False
|
||||||
|
return removed
|
||||||
|
|
||||||
|
|
||||||
|
@app.delete("/db/events/{event_id}")
|
||||||
|
def db_delete_event(event_id: str) -> dict:
|
||||||
|
"""
|
||||||
|
Hard-delete a single event from the SFM events table and remove any
|
||||||
|
associated on-disk waveform/sidecar/pickle/hdf5 files.
|
||||||
|
|
||||||
|
Returns 404 if the event_id is not found.
|
||||||
|
"""
|
||||||
|
log.info("DELETE /db/events/%s", event_id)
|
||||||
|
deleted = _get_db().delete_event(event_id)
|
||||||
|
if deleted is None:
|
||||||
|
raise HTTPException(status_code=404, detail=f"Event {event_id} not found")
|
||||||
|
files_removed = _cleanup_event_files(deleted)
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"event_id": event_id,
|
||||||
|
"files_removed": files_removed,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
class BulkDeleteBody(BaseModel):
|
||||||
|
"""Body for POST /db/events/delete_bulk."""
|
||||||
|
serial: Optional[str] = None
|
||||||
|
from_dt: Optional[str] = None # ISO-8601
|
||||||
|
to_dt: Optional[str] = None # ISO-8601
|
||||||
|
false_trigger: Optional[bool] = None
|
||||||
|
ids: Optional[list[str]] = None
|
||||||
|
confirm: bool = False
|
||||||
|
# Safety: when no `ids` are supplied, require this many max rows to
|
||||||
|
# actually be deleted; if the matched count exceeds it, the endpoint
|
||||||
|
# returns a dry-run-style summary instead. Pass None to disable.
|
||||||
|
max_rows: Optional[int] = 10000
|
||||||
|
|
||||||
|
|
||||||
|
@app.post("/db/events/delete_bulk")
|
||||||
|
def db_delete_events_bulk(body: BulkDeleteBody) -> dict:
|
||||||
|
"""
|
||||||
|
Hard-delete multiple events at once, by filter and/or by id list.
|
||||||
|
|
||||||
|
Filters (`serial`, `from_dt`, `to_dt`, `false_trigger`) combine with AND,
|
||||||
|
matching the same semantics as `GET /db/events`. `ids` is an additional
|
||||||
|
inclusion list. At least one filter or non-empty `ids` MUST be supplied
|
||||||
|
— refusing to wipe the whole table.
|
||||||
|
|
||||||
|
Safety knobs:
|
||||||
|
- `confirm` MUST be `true` to actually delete. When false (default),
|
||||||
|
returns the match count without deleting (dry-run).
|
||||||
|
- `max_rows` (default 10,000) caps how many rows can be deleted in one
|
||||||
|
call by-filter; if the match count exceeds it, the endpoint returns
|
||||||
|
a count summary without deleting. Ignored when only `ids` is used.
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
{
|
||||||
|
"status": "ok" | "dry_run" | "too_many",
|
||||||
|
"matched": <int>,
|
||||||
|
"deleted": <int>, # 0 unless status == "ok"
|
||||||
|
"files_removed": <int>, # total file unlink successes
|
||||||
|
"sample_serials": [...], # up to 5 distinct serials touched
|
||||||
|
}
|
||||||
|
"""
|
||||||
|
log.info(
|
||||||
|
"POST /db/events/delete_bulk serial=%s from=%s to=%s ft=%s ids=%d confirm=%s max=%s",
|
||||||
|
body.serial, body.from_dt, body.to_dt, body.false_trigger,
|
||||||
|
len(body.ids or []), body.confirm, body.max_rows,
|
||||||
|
)
|
||||||
|
|
||||||
|
from_parsed = datetime.datetime.fromisoformat(body.from_dt) if body.from_dt else None
|
||||||
|
to_parsed = datetime.datetime.fromisoformat(body.to_dt) if body.to_dt else None
|
||||||
|
|
||||||
|
db = _get_db()
|
||||||
|
|
||||||
|
# Dry-run path: count matches without deleting.
|
||||||
|
rows = db.query_events(
|
||||||
|
serial=body.serial,
|
||||||
|
from_dt=from_parsed,
|
||||||
|
to_dt=to_parsed,
|
||||||
|
false_trigger=body.false_trigger,
|
||||||
|
limit=1_000_000, # we want a true count, not a page
|
||||||
|
offset=0,
|
||||||
|
)
|
||||||
|
if body.ids:
|
||||||
|
id_set = set(body.ids)
|
||||||
|
rows = [r for r in rows if r["id"] in id_set]
|
||||||
|
matched = len(rows)
|
||||||
|
sample_serials = sorted({r.get("serial") for r in rows[:50] if r.get("serial")})[:5]
|
||||||
|
|
||||||
|
if not body.confirm:
|
||||||
|
return {
|
||||||
|
"status": "dry_run",
|
||||||
|
"matched": matched,
|
||||||
|
"deleted": 0,
|
||||||
|
"files_removed": 0,
|
||||||
|
"sample_serials": sample_serials,
|
||||||
|
"hint": "Set confirm=true in the request body to actually delete.",
|
||||||
|
}
|
||||||
|
|
||||||
|
if body.max_rows is not None and not body.ids and matched > body.max_rows:
|
||||||
|
return {
|
||||||
|
"status": "too_many",
|
||||||
|
"matched": matched,
|
||||||
|
"deleted": 0,
|
||||||
|
"files_removed": 0,
|
||||||
|
"sample_serials": sample_serials,
|
||||||
|
"hint": (
|
||||||
|
f"Matched {matched} > max_rows={body.max_rows}. Either raise "
|
||||||
|
f"max_rows in the body, narrow the filter, or supply an "
|
||||||
|
f"explicit `ids` list."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
try:
|
||||||
|
deleted_rows = db.delete_events_bulk(
|
||||||
|
serial=body.serial,
|
||||||
|
from_dt=from_parsed,
|
||||||
|
to_dt=to_parsed,
|
||||||
|
false_trigger=body.false_trigger,
|
||||||
|
ids=body.ids,
|
||||||
|
)
|
||||||
|
except ValueError as exc:
|
||||||
|
raise HTTPException(status_code=422, detail=str(exc)) from exc
|
||||||
|
|
||||||
|
files_removed = 0
|
||||||
|
for row in deleted_rows:
|
||||||
|
result = _cleanup_event_files(row)
|
||||||
|
files_removed += sum(1 for ok in result.values() if ok)
|
||||||
|
|
||||||
|
return {
|
||||||
|
"status": "ok",
|
||||||
|
"matched": matched,
|
||||||
|
"deleted": len(deleted_rows),
|
||||||
|
"files_removed": files_removed,
|
||||||
|
"sample_serials": sample_serials,
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
# ── /db/events/{id} — waveform file accessors ─────────────────────────────────
|
# ── /db/events/{id} — waveform file accessors ─────────────────────────────────
|
||||||
#
|
#
|
||||||
# These endpoints serve files from the persistent WaveformStore, so a Blastware
|
# These endpoints serve files from the persistent WaveformStore, so a Blastware
|
||||||
|
|||||||
@@ -300,6 +300,16 @@ class WaveformStore:
|
|||||||
except FileNotFoundError:
|
except FileNotFoundError:
|
||||||
pass
|
pass
|
||||||
|
|
||||||
|
# read_blastware_file derives record_type from its path arg, but
|
||||||
|
# that arg is the tmp file (suffix ".bw") — so override with the
|
||||||
|
# original filename's encoded type (H/W/M/E/C in the BW AB0T
|
||||||
|
# scheme). Without this override every BW-imported event lands
|
||||||
|
# in the DB with record_type="Waveform" regardless of the actual
|
||||||
|
# type (Histogram, Manual, etc.).
|
||||||
|
ev.record_type = event_file_io.derive_record_type_from_filename(
|
||||||
|
source_path.name
|
||||||
|
)
|
||||||
|
|
||||||
# Parse the BW ASCII report if one was supplied. Failures here
|
# Parse the BW ASCII report if one was supplied. Failures here
|
||||||
# are non-fatal: we still write the binary + sidecar without the
|
# are non-fatal: we still write the binary + sidecar without the
|
||||||
# rich derived fields.
|
# rich derived fields.
|
||||||
|
|||||||
Reference in New Issue
Block a user