Update to v 0.4.0 #6

2026-06-22T18:06:35-04:00

serversdown commented

2026-06-22 18:06:35 -04:00

[0.4.0] - 2026-06-22

--
|
| ### Added
|
| #### Live Monitor (fan-out feed)
| - Per-device fan-out monitor - one shared, cached live feed per device. Multiple clients (dashboards, portal, charts) subscribe to the same stream instead of each fighting for the NL-43's single TCP connection: one poller reads the device, all subscribers get the same frames.
| - WebSocket monitor - WS /api/nl43/{unit_id}/monitor delivers an instant first frame from cache, then live updates.
| - Monitor control - POST /api/nl43/{unit_id}/monitor/{start\|stop}, GET /api/nl43/_monitor/status. A persistent monitor_enabled flag auto-starts the keepalive on boot.
| - Adaptive polling - poll rate adapts to demand; unreachable devices back off; a device-offline alert fires when a monitored unit drops.
| - De-duplication - the background poller skips units already covered by an active monitor (no double-polling); a heartbeat keeps the feed warm.
| - Lower latency - the monitor caches run state, roughly halving live-feed latency; fan-out emits an instant first frame + offline status to new clients.
|
| #### Alert Engine
| - Threshold rules - per-device alert rules (metric + threshold + cooldown) with full CRUD: POST/GET/PUT/DELETE /api/nl43/{unit_id}/alerts/rules[/{rule_id}].
| - Events + state machine - onset/clear tracking via GET /api/nl43/{unit_id}/alerts/events; acknowledge with POST .../events/{event_id}/ack. A cooldown_s is enforced between onsets.
| - 24/7 evaluation - enabled rules pin the monitor on, so rules evaluate continuously even with no UI client connected.
| - Resilience - editing or deleting a rule resets its state and closes any open event; device-offline events are raised when a monitored unit goes unreachable.
|
| #### Data & History
| - Live-chart backfill - a downsampled DOD trail is persisted to a new nl43_readings table, exposed via GET /api/nl43/{unit_id}/history so charts can backfill recent history on load.
| - LN1/LN2 percentiles - L1/L10 (configurable percentiles) surfaced through SLMM in the status and live-feed payloads.
| - measurement_start_time included in the cached /status response.
|
| #### Device control
| - Per-device disconnect - POST /api/nl43/{unit_id}/disconnect drops a device's pooled connection.
| - Deactivate / standby - POST /api/nl43/{unit_id}/deactivate and global POST /api/nl43/_system/standby to quiesce polling/monitoring.
|
| ### Changed
| - DRD streaming reuses the pooled connection rather than opening a separate socket, avoiding contention with the persistent pool on a single-connection device.
| - Connection pool - idle-TTL / max-age checks can now be disabled; pool status is logged periodically.
|
| ### Fixed
| - Measurement-start confirmation - /start now recognizes the device's Start state. It previously waited for Measure, which never matched, so the start cycle ran the full retry loop and Terra-View's proxy timed out with a misleading "Unknown error" even though the device had started.
| - Garbled reads - corrupted measurement-state reads that produced phantom STOPPED/STARTED transitions are now ignored.
| - DOD parsing - corrected field parsing and stopped spurious measurement-time resets.
| - Monitor WebSocket - quieted a send-after-close race on client disconnect.
|
| ### Database
| - New tables (auto-created on startup via Base.metadata.create_all): alert_rules, alert_events, nl43_readings.
| - Migrations for existing tables (run once per database): migrate_add_ln_percentiles.py (LN1/LN2 on nl43_status), migrate_add_monitor_enabled.py (monitor_enabled on nl43_config).
|
| ### Notes
| - Pairs with the matching Terra-View dev build, which reads SLMM's /monitor fan-out feed for live SLM dashboards (L1/L10 lines, live-chart backfill). Ship the two together.
|
| ---

## [0.4.0] - 2026-06-22 -- | | ### Added | | #### Live Monitor (fan-out feed) | - **Per-device fan-out monitor** - one shared, cached live feed per device. Multiple clients (dashboards, portal, charts) subscribe to the same stream instead of each fighting for the NL-43's single TCP connection: one poller reads the device, all subscribers get the same frames. | - **WebSocket monitor** - `WS /api/nl43/{unit_id}/monitor` delivers an instant first frame from cache, then live updates. | - **Monitor control** - `POST /api/nl43/{unit_id}/monitor/{start\|stop}`, `GET /api/nl43/_monitor/status`. A persistent `monitor_enabled` flag auto-starts the keepalive on boot. | - **Adaptive polling** - poll rate adapts to demand; unreachable devices back off; a device-offline alert fires when a monitored unit drops. | - **De-duplication** - the background poller skips units already covered by an active monitor (no double-polling); a heartbeat keeps the feed warm. | - **Lower latency** - the monitor caches run state, roughly halving live-feed latency; fan-out emits an instant first frame + offline status to new clients. | | #### Alert Engine | - **Threshold rules** - per-device alert rules (metric + threshold + cooldown) with full CRUD: `POST/GET/PUT/DELETE /api/nl43/{unit_id}/alerts/rules[/{rule_id}]`. | - **Events + state machine** - onset/clear tracking via `GET /api/nl43/{unit_id}/alerts/events`; acknowledge with `POST .../events/{event_id}/ack`. A `cooldown_s` is enforced between onsets. | - **24/7 evaluation** - enabled rules pin the monitor on, so rules evaluate continuously even with no UI client connected. | - **Resilience** - editing or deleting a rule resets its state and closes any open event; device-offline events are raised when a monitored unit goes unreachable. | | #### Data & History | - **Live-chart backfill** - a downsampled DOD trail is persisted to a new `nl43_readings` table, exposed via `GET /api/nl43/{unit_id}/history` so charts can backfill recent history on load. | - **LN1/LN2 percentiles** - L1/L10 (configurable percentiles) surfaced through SLMM in the status and live-feed payloads. | - **measurement_start_time** included in the cached `/status` response. | | #### Device control | - **Per-device disconnect** - `POST /api/nl43/{unit_id}/disconnect` drops a device's pooled connection. | - **Deactivate / standby** - `POST /api/nl43/{unit_id}/deactivate` and global `POST /api/nl43/_system/standby` to quiesce polling/monitoring. | | ### Changed | - **DRD streaming reuses the pooled connection** rather than opening a separate socket, avoiding contention with the persistent pool on a single-connection device. | - **Connection pool** - idle-TTL / max-age checks can now be disabled; pool status is logged periodically. | | ### Fixed | - **Measurement-start confirmation** - `/start` now recognizes the device's `Start` state. It previously waited for `Measure`, which never matched, so the start cycle ran the full retry loop and Terra-View's proxy timed out with a misleading "Unknown error" even though the device had started. | - **Garbled reads** - corrupted measurement-state reads that produced phantom STOPPED/STARTED transitions are now ignored. | - **DOD parsing** - corrected field parsing and stopped spurious measurement-time resets. | - **Monitor WebSocket** - quieted a send-after-close race on client disconnect. | | ### Database | - **New tables** (auto-created on startup via `Base.metadata.create_all`): `alert_rules`, `alert_events`, `nl43_readings`. | - **Migrations for existing tables** (run once per database): `migrate_add_ln_percentiles.py` (LN1/LN2 on `nl43_status`), `migrate_add_monitor_enabled.py` (`monitor_enabled` on `nl43_config`). | | ### Notes | - Pairs with the matching Terra-View `dev` build, which reads SLMM's `/monitor` fan-out feed for live SLM dashboards (L1/L10 lines, live-chart backfill). Ship the two together. | | ---

serversdown added 21 commits 2026-06-22 18:06:35 -04:00

fix: correct DOD field parsing and stop measurement-time resets a7983d2958

Two device-data bugs surfaced while scoping the live-feed work:

1. DOD parser misalignment. DOD's response has no leading counter and
   includes LE + LN1-LN5, but the parser reused the DRD field map
   (parts[0]=counter). That shifted everything: Lp was stored as the
   counter, Leq as Lp, LE as Leq, and LN1 as Lpeak (visible because
   "Lpeak" came out below Lmax, which is impossible). Parse DOD with its
   own map: Lp=0, Leq=1, Lmax=3, Lmin=4, Lpeak=10 (channel 1 = main).

2. measurement_start_time reset on every live-stream open/close. The DOD
   path tags state "Start"; the DRD stream path tags "Measure". The
   transition detector treated only "Start" as measuring, so opening the
   stream ("Start"->"Measure") read as a stop (cleared start time) and
   closing it ("Measure"->"Start") read as a start (reset to now). Every
   viewer reset the elapsed measurement time. Treat {"Start","Measure"}
   both as measuring.

LN1/LN2 (L1/L10) parsing + model/serialization is the next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: surface LN1/LN2 (L1/L10) percentiles through SLMM 51dd6b682d

Completes the SLMM side of the L1/L10 live-display contract. The NL-43's
DOD response carries percentile slots LN1-LN5 (channel 1, parts[5]/[6]);
parse the first two and expose them as ln1/ln2 end to end:

- NL43Snapshot dataclass: ln1/ln2 fields
- NL43Status model: ln1/ln2 columns (+ migrate_add_ln_percentiles.py)
- DOD parser: snap.ln1=parts[5], snap.ln2=parts[6]
- persist_snapshot writes them
- all /status data dicts, StatusPayload, and the DRD stream payload emit
  ln1/ln2 (null on the DRD stream itself, which doesn't carry percentiles)

Labels: device LN1 defaults to L5, not L1 — Terra-View defaults the label
to L1/L10, so the device's Ln1/Ln2 slots must be set to 1%/10% for the
labels to be accurate (dynamic label emission is a follow-up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: add per-device disconnect endpoint 0793e7df01

POST /api/nl43/{unit_id}/disconnect cleanly closes (TCP FIN + wait_closed)
and drops the pooled connection for a single device, freeing the NL43's
one connection slot. Previously only /_connections/flush existed, which
tears down every device at once.

Idempotent; no-op if nothing is cached. Releases the idle pooled
connection only — an active DRD stream/command has the socket checked out
of the pool, so close the stream WebSocket to end a live stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: per-unit deactivate and global SLMM standby b954eb8c89

Lets an instance stop occupying a device's single TCP connection slot so
another instance (e.g. prod) can take over.

Per-unit:
- POST /api/nl43/{unit_id}/deactivate — poll_enabled=False (persisted) +
  drop the connection (waits up to 10s for in-flight ops via the device
  lock, then discards). Unit stays dormant across restarts.
- POST /api/nl43/{unit_id}/activate — re-enable polling.

Global standby:
- POST /api/nl43/_system/standby — poller idles and releases ALL
  connections; the loop keeps re-releasing so the instance holds no slots.
- POST /api/nl43/_system/resume — resume polling.
- GET  /api/nl43/_system/status — active vs standby + active_connections.
- SLMM_POLLING_ENABLED=false starts an instance in standby (persistent
  way to keep a dev box from latching onto a prod-owned device).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix: ignore garbled measurement-state reads (phantom STOPPED/STARTED) 8c17af4849

A buffer desync on the shared persistent connection (commonly right after
a DRD/DOD test) can make a Measure? read return a stray value. The state
classifier treated anything not in {"Start","Measure"} as "not measuring",
so a garbled read logged a phantom STOPPED, the next clean read logged
STARTED, and that reset measurement_start_time — producing constant
STOPPED/STARTED device-log pairs and a drifting elapsed timer.

Now only recognized states drive transitions: {"Start","Measure"} =
measuring, {"Stop"} = stopped, anything else = no change. Garbled reads
are also not persisted as the cached state, so they can't poison the next
transition check. Builds on the earlier Start<->Measure normalization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: per-device live monitor (fan-out) + alert evaluator (POC) aa3e088b64

The piece the live-view + alerting work was building toward.

monitor.py — one DOD poll loop per device, broadcast to many subscribers:
- browser WebSockets (fixes the single-connection "second viewer sees
  nothing" contention — browsers no longer each open a device stream)
- the alert evaluator (can keep a feed running with no browser via
  /monitor/start, so alerting runs continuously)
- persistence (each snapshot written like the poller)
DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls
go through the existing per-device lock + pool, so it serializes safely with
the background poller and on-demand commands.

alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds
ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule
(instantaneous vs sustained vs L10) is the single swap point; dispatch is a
server log for now (email/SMS later).

Endpoints:
- WS   /api/nl43/{unit_id}/monitor          subscribe to the shared feed
- POST /api/nl43/{unit_id}/monitor/start    keep feed alive w/o a browser
- POST /api/nl43/{unit_id}/monitor/stop     drop the keep-alive
- GET  /api/nl43/_monitor/status            running/subscribers/keepalive

WS endpoint races queue.get() against a disconnect watcher so an idle feed
still detects client drop and doesn't leak a subscription.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: alert engine stage 1 — rules, events, state machine, CRUD 9c43e68534

Replaces the POC single-threshold check with a real per-rule engine over
the live monitor feed.

- AlertRule / AlertEvent tables (auto-created via create_all; no migration).
  Rule = {metric, comparison, threshold_db, duration_s, clear_margin_db,
  schedule, channels, recipients}.
- alerts.py: per-(unit,rule) state machine IDLE->ACTIVE->IDLE with duration
  debounce (both edges) + clear_margin hysteresis; onset/clear are distinct
  events; optional nighttime schedule; rule cache w/ invalidation. The
  state-machine core (_evaluate_step) is pure (no DB/clock) for testing.
- Dispatch is a server log (POC); _dispatch() is the seam for a Terra-View
  webhook (email/SMS) later.
- CRUD: POST/GET/PUT/DELETE /{unit}/alerts/rules, GET /{unit}/alerts/events,
  POST /{unit}/alerts/events/{id}/ack.
- test_alert_evaluator.py: synthetic level series proves onset debounce,
  spike rejection, hysteresis hold, and below-comparison (4/4 pass, no device).

Source-agnostic: the same rules transfer unchanged if a unit's feed is later
sourced from FTP intervals instead of the DOD monitor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: harden fan-out for live clients — instant first frame + offline status 6b1ec75396

For multiple clients connecting to a live feed (e.g. the client portal):
- cache the last broadcast frame and replay it to a new subscriber on
  connect, so a client sees data immediately instead of waiting a full
  poll cycle.
- broadcast a {"feed_status":"unreachable"} frame once on transition (after
  3 consecutive poll failures) so clients can render an offline state
  instead of a frozen chart; data frames now carry "feed_status":"ok".
  The cached frame reflects current state, so a client connecting while
  offline gets "unreachable" right away too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: monitor heartbeat + background poller skips active-monitored units ba622c67d8

- Heartbeat: if nothing has been broadcast in MONITOR_HEARTBEAT_S (default
  25s) — e.g. device offline and silent — send a non-cached keepalive frame
  so a reverse proxy (NPM) doesn't drop the idle WS. New subscribers still
  get the last real frame, not a heartbeat.
- Poller-skip: the 60s background poller now skips any unit with a running
  monitor (MonitorManager.is_active). The monitor already polls it ~1Hz and
  keeps the status cache fresh, so the background poll was redundant and just
  added load/lock-contention on the device's single connection (and churn,
  which matters for the cellular wedge). Trade-off: the FTP start-time sync
  (only in the poller) doesn't run while a unit is actively monitored — fine,
  since reports take the authoritative start time from the FTP .rnd data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Merge pull request 'merge drd-fix into dev' (#5 ) from feat/drd-fix into dev 87c06f1519

Reviewed-on: #5

perf: monitor caches run state, ~halving live-feed latency 9d34779171

Each monitor poll was sending DOD? + Measure? (two commands), and the NL43
enforces >=1s between commands, so updates were ~2.5s apart. The run state
changes rarely, so cache it and refresh via Measure? only every
MONITOR_STATE_REFRESH_S (default 30s); most polls now send just DOD? (one
rate-limited command) -> ~1.3s/update. Also trim MONITOR_POLL_INTERVAL to
0.25s since the device rate-limit is the real pacer.

request_dod() gains an optional measurement_state arg: when supplied it
reuses that state and skips the Measure? round-trip; None preserves the old
query-every-time behavior.

~1Hz is the device floor for DOD (the >=1s command spacing); DRD's 10Hz
push isn't reachable via polling, but ~1s is a normal cadence for SLM levels.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: persistent monitor_enabled flag + auto-start keepalive on boot 43e72ae3c3

Makes live monitoring (and therefore alerting) genuinely 24/7 and
restart-surviving, instead of runtime-only keepalive.

- NL43Config.monitor_enabled (default True) + migrate_add_monitor_enabled.py.
- On startup, auto-start keepalive monitors for every monitor_enabled +
  tcp_enabled unit — so feeds/alerts resume after a restart with no manual step.
- /monitor/start and /monitor/stop now PERSIST monitor_enabled (start=True,
  stop=False) in addition to applying keepalive at runtime, so the toggle
  sticks. Roster output includes monitor_enabled for the admin UI to read.

On by default: configure a unit -> it's monitored 24/7 unless toggled off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: downsampled DOD trail + history endpoint for live-chart backfill d1d694302c

So a viewer sees recent trend on open instead of a blank chart. Viewing
only — reports still use the device's FTP .rnd data.

- NL43Reading table (auto-creates; no migration): unit_id, timestamp,
  lp/leq/lmax/ln1/ln2.
- Monitor stores one downsampled reading per MONITOR_TRAIL_SAMPLE_S
  (default 60s) from its keepalive poll loop, pruning rows older than
  MONITOR_TRAIL_RETENTION_HOURS (default 24h). ~1440 rows/unit max.
- GET /api/nl43/{unit}/history?hours=N -> the trail for the last N hours
  (clamped 0.1-48h), oldest-first.

Because keepalive runs 24/7, the trail fills continuously, so the history
is there whenever someone opens the live view.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat: include measurement_start_time in cached /status response b4cea2f287

So consumers (e.g. the command center) can read the elapsed-time clock from
the cached status instead of a fresh device /live read. Added to both the
GET and POST /status data dicts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(monitor): adaptive poll rate, unreachable backoff, device-offline alert 1f5f1fb1f6

Three changes to cut wasted device/cellular load and surface outages:

- Adaptive interval: full-rate (~1.25s) while a browser is subscribed for a
  smooth chart; relaxed cadence (MONITOR_IDLE_POLL_INTERVAL, default 10s) when
  the feed is keepalive-only (alerting). ~8x fewer polls with no viewer ->
  ~8x less cellular traffic on a metered SIM. Note: idle interval also sets the
  alert sampling resolution when nobody is watching.
- Exponential backoff when the device is unreachable (1->2->...->60s cap),
  reset on the first good poll, so a dead/asleep device stops churning
  reconnects (log spam + wasted SYN traffic). Capped at 5s while a browser is
  watching so a recovery still surfaces quickly.
- Device-offline alert: the reachable->unreachable transition raises a
  connectivity AlertEvent (sentinel rule_id=0, metric="connectivity") through
  the existing evaluator/dispatch seam; recovery clears it. Deduped in memory
  and via the DB (so a restart mid-outage doesn't duplicate the event).

MonitorManager.status() now reports reachable + current mode (watched/idle/
backoff) for observability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(monitor): quiet send-after-close race on WS disconnect 5bc542e92f

When a monitor subscriber disconnects mid-frame (the client portal closes its
stream on every tab switch via the Page Visibility guard), the loop could pull a
queued payload during the 1s wait and then send_json into an already-closing
socket -> "Unexpected ASGI message 'websocket.send' after ... websocket.close",
logged as a WARNING on every disconnect.

Re-check gone.done() after the queue wait and break before sending; treat the
residual send-after-close as expected (debug, not warning). No behavior change —
the connection was already closing as intended; this just stops the log spam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

feat(alerts): enabled rules pin the monitor on (24/7 evaluation) b51fefca2b

The evaluator only runs inside the monitor loop, so a rule on an idle device
never fired. Now creating/updating/deleting an alert rule calls
_sync_keepalive_to_rules: if the unit has any enabled rule, persist
NL43Config.monitor_enabled=True (so the boot auto-start re-enables it after a
restart) and turn on runtime keepalive. Never auto-OFF — a device may be kept
alive for other reasons; operators control that on /admin/slmm. Alert CRUD
endpoints are now async to await the monitor manager.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(alerts): enforce cooldown_s between onsets cfdeada9d6

cooldown_s was stored + shown in the UI but never read, so a repeatedly-breaching
signal (e.g. intermittent traffic noise) would flood the alert history with an
event per spike. The evaluator now suppresses a new onset within cooldown_s of the
last, holding the edge so it fires the moment the window lapses if still breaching.
Hysteresis still gates clears. getattr-guarded so partial rule fixtures don't crash.

Verified: existing 4 evaluator tests pass; cooldown scenario (onset → clear →
suppressed re-breach → onset after window) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix(alerts): reset rule state + close open event on rule edit/delete ad6071b790

invalidate() only dropped the rule cache, not the per-(unit,rule) state machine —
so editing a rule's metric/threshold left a stale 'active' phase that mis-evaluated
against the new config (spurious clear, or suppressed onset), and deleting an
in-alarm rule left an open AlertEvent that kept the client portal stuck "in alarm"
forever. update/delete now call _reset_rule_runtime: forget_rule() drops the state
machine and any open event for that rule is closed.

Verified: existing evaluator tests + cooldown scenario still pass; compiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

fix: recognize 'Start' state when confirming measurement start 6d1c426ee4

The /start handler waited for measurement_state == "Measure", but the
device reports "Start" while measuring. The confirmation check therefore
never matched, so the post-start status loop always ran its full 3x DOD
retry cycle over cellular, pushing the call past ~10s. That blew past the
Terra-View proxy's request timeout and surfaced to users as a misleading
"Unknown error" even though the unit had already started recording.

Match the device's actual reported state (and stay consistent with
persist_snapshot's MEASURING_STATES handling) so /start confirms on the
first attempt and returns promptly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chore: version bump 43b8e53d2d

serversdown merged commit 89b6892656 into main

2026-06-22 18:07:37 -04:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: serversdown/slmm#6