Update to v 0.4.0 #6
Reference in New Issue
Block a user
Delete Branch "dev"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
[0.4.0] - 2026-06-22
--
|
| ### Added
|
| #### Live Monitor (fan-out feed)
| - Per-device fan-out monitor - one shared, cached live feed per device. Multiple clients (dashboards, portal, charts) subscribe to the same stream instead of each fighting for the NL-43's single TCP connection: one poller reads the device, all subscribers get the same frames.
| - WebSocket monitor -
WS /api/nl43/{unit_id}/monitordelivers an instant first frame from cache, then live updates.| - Monitor control -
POST /api/nl43/{unit_id}/monitor/{start\|stop},GET /api/nl43/_monitor/status. A persistentmonitor_enabledflag auto-starts the keepalive on boot.| - Adaptive polling - poll rate adapts to demand; unreachable devices back off; a device-offline alert fires when a monitored unit drops.
| - De-duplication - the background poller skips units already covered by an active monitor (no double-polling); a heartbeat keeps the feed warm.
| - Lower latency - the monitor caches run state, roughly halving live-feed latency; fan-out emits an instant first frame + offline status to new clients.
|
| #### Alert Engine
| - Threshold rules - per-device alert rules (metric + threshold + cooldown) with full CRUD:
POST/GET/PUT/DELETE /api/nl43/{unit_id}/alerts/rules[/{rule_id}].| - Events + state machine - onset/clear tracking via
GET /api/nl43/{unit_id}/alerts/events; acknowledge withPOST .../events/{event_id}/ack. Acooldown_sis enforced between onsets.| - 24/7 evaluation - enabled rules pin the monitor on, so rules evaluate continuously even with no UI client connected.
| - Resilience - editing or deleting a rule resets its state and closes any open event; device-offline events are raised when a monitored unit goes unreachable.
|
| #### Data & History
| - Live-chart backfill - a downsampled DOD trail is persisted to a new
nl43_readingstable, exposed viaGET /api/nl43/{unit_id}/historyso charts can backfill recent history on load.| - LN1/LN2 percentiles - L1/L10 (configurable percentiles) surfaced through SLMM in the status and live-feed payloads.
| - measurement_start_time included in the cached
/statusresponse.|
| #### Device control
| - Per-device disconnect -
POST /api/nl43/{unit_id}/disconnectdrops a device's pooled connection.| - Deactivate / standby -
POST /api/nl43/{unit_id}/deactivateand globalPOST /api/nl43/_system/standbyto quiesce polling/monitoring.|
| ### Changed
| - DRD streaming reuses the pooled connection rather than opening a separate socket, avoiding contention with the persistent pool on a single-connection device.
| - Connection pool - idle-TTL / max-age checks can now be disabled; pool status is logged periodically.
|
| ### Fixed
| - Measurement-start confirmation -
/startnow recognizes the device'sStartstate. It previously waited forMeasure, which never matched, so the start cycle ran the full retry loop and Terra-View's proxy timed out with a misleading "Unknown error" even though the device had started.| - Garbled reads - corrupted measurement-state reads that produced phantom STOPPED/STARTED transitions are now ignored.
| - DOD parsing - corrected field parsing and stopped spurious measurement-time resets.
| - Monitor WebSocket - quieted a send-after-close race on client disconnect.
|
| ### Database
| - New tables (auto-created on startup via
Base.metadata.create_all):alert_rules,alert_events,nl43_readings.| - Migrations for existing tables (run once per database):
migrate_add_ln_percentiles.py(LN1/LN2 onnl43_status),migrate_add_monitor_enabled.py(monitor_enabledonnl43_config).|
| ### Notes
| - Pairs with the matching Terra-View
devbuild, which reads SLMM's/monitorfan-out feed for live SLM dashboards (L1/L10 lines, live-chart backfill). Ship the two together.|
| ---
Two device-data bugs surfaced while scoping the live-feed work: 1. DOD parser misalignment. DOD's response has no leading counter and includes LE + LN1-LN5, but the parser reused the DRD field map (parts[0]=counter). That shifted everything: Lp was stored as the counter, Leq as Lp, LE as Leq, and LN1 as Lpeak (visible because "Lpeak" came out below Lmax, which is impossible). Parse DOD with its own map: Lp=0, Leq=1, Lmax=3, Lmin=4, Lpeak=10 (channel 1 = main). 2. measurement_start_time reset on every live-stream open/close. The DOD path tags state "Start"; the DRD stream path tags "Measure". The transition detector treated only "Start" as measuring, so opening the stream ("Start"->"Measure") read as a stop (cleared start time) and closing it ("Measure"->"Start") read as a start (reset to now). Every viewer reset the elapsed measurement time. Treat {"Start","Measure"} both as measuring. LN1/LN2 (L1/L10) parsing + model/serialization is the next step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>POST /api/nl43/{unit_id}/disconnect cleanly closes (TCP FIN + wait_closed) and drops the pooled connection for a single device, freeing the NL43's one connection slot. Previously only /_connections/flush existed, which tears down every device at once. Idempotent; no-op if nothing is cached. Releases the idle pooled connection only — an active DRD stream/command has the socket checked out of the pool, so close the stream WebSocket to end a live stream. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Lets an instance stop occupying a device's single TCP connection slot so another instance (e.g. prod) can take over. Per-unit: - POST /api/nl43/{unit_id}/deactivate — poll_enabled=False (persisted) + drop the connection (waits up to 10s for in-flight ops via the device lock, then discards). Unit stays dormant across restarts. - POST /api/nl43/{unit_id}/activate — re-enable polling. Global standby: - POST /api/nl43/_system/standby — poller idles and releases ALL connections; the loop keeps re-releasing so the instance holds no slots. - POST /api/nl43/_system/resume — resume polling. - GET /api/nl43/_system/status — active vs standby + active_connections. - SLMM_POLLING_ENABLED=false starts an instance in standby (persistent way to keep a dev box from latching onto a prod-owned device). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>A buffer desync on the shared persistent connection (commonly right after a DRD/DOD test) can make a Measure? read return a stray value. The state classifier treated anything not in {"Start","Measure"} as "not measuring", so a garbled read logged a phantom STOPPED, the next clean read logged STARTED, and that reset measurement_start_time — producing constant STOPPED/STARTED device-log pairs and a drifting elapsed timer. Now only recognized states drive transitions: {"Start","Measure"} = measuring, {"Stop"} = stopped, anything else = no change. Garbled reads are also not persisted as the cached state, so they can't poison the next transition check. Builds on the earlier Start<->Measure normalization. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>The piece the live-view + alerting work was building toward. monitor.py — one DOD poll loop per device, broadcast to many subscribers: - browser WebSockets (fixes the single-connection "second viewer sees nothing" contention — browsers no longer each open a device stream) - the alert evaluator (can keep a feed running with no browser via /monitor/start, so alerting runs continuously) - persistence (each snapshot written like the poller) DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls go through the existing per-device lock + pool, so it serializes safely with the background poller and on-demand commands. alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule (instantaneous vs sustained vs L10) is the single swap point; dispatch is a server log for now (email/SMS later). Endpoints: - WS /api/nl43/{unit_id}/monitor subscribe to the shared feed - POST /api/nl43/{unit_id}/monitor/start keep feed alive w/o a browser - POST /api/nl43/{unit_id}/monitor/stop drop the keep-alive - GET /api/nl43/_monitor/status running/subscribers/keepalive WS endpoint races queue.get() against a disconnect watcher so an idle feed still detects client drop and doesn't leak a subscription. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>Replaces the POC single-threshold check with a real per-rule engine over the live monitor feed. - AlertRule / AlertEvent tables (auto-created via create_all; no migration). Rule = {metric, comparison, threshold_db, duration_s, clear_margin_db, schedule, channels, recipients}. - alerts.py: per-(unit,rule) state machine IDLE->ACTIVE->IDLE with duration debounce (both edges) + clear_margin hysteresis; onset/clear are distinct events; optional nighttime schedule; rule cache w/ invalidation. The state-machine core (_evaluate_step) is pure (no DB/clock) for testing. - Dispatch is a server log (POC); _dispatch() is the seam for a Terra-View webhook (email/SMS) later. - CRUD: POST/GET/PUT/DELETE /{unit}/alerts/rules, GET /{unit}/alerts/events, POST /{unit}/alerts/events/{id}/ack. - test_alert_evaluator.py: synthetic level series proves onset debounce, spike rejection, hysteresis hold, and below-comparison (4/4 pass, no device). Source-agnostic: the same rules transfer unchanged if a unit's feed is later sourced from FTP intervals instead of the DOD monitor. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>For multiple clients connecting to a live feed (e.g. the client portal): - cache the last broadcast frame and replay it to a new subscriber on connect, so a client sees data immediately instead of waiting a full poll cycle. - broadcast a {"feed_status":"unreachable"} frame once on transition (after 3 consecutive poll failures) so clients can render an offline state instead of a frozen chart; data frames now carry "feed_status":"ok". The cached frame reflects current state, so a client connecting while offline gets "unreachable" right away too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>So a viewer sees recent trend on open instead of a blank chart. Viewing only — reports still use the device's FTP .rnd data. - NL43Reading table (auto-creates; no migration): unit_id, timestamp, lp/leq/lmax/ln1/ln2. - Monitor stores one downsampled reading per MONITOR_TRAIL_SAMPLE_S (default 60s) from its keepalive poll loop, pruning rows older than MONITOR_TRAIL_RETENTION_HOURS (default 24h). ~1440 rows/unit max. - GET /api/nl43/{unit}/history?hours=N -> the trail for the last N hours (clamped 0.1-48h), oldest-first. Because keepalive runs 24/7, the trail fills continuously, so the history is there whenever someone opens the live view. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>