Feat: add SLM live monitoring improvements #60

Merged
serversdown merged 13 commits from feat/slm-live-monitor into dev 2026-06-10 16:33:26 -04:00

13 Commits

Author SHA1 Message Date
serversdown 2e832708f3 docs: changelog [Unreleased] — SLM live monitoring (fan-out feed + cache-first reads)
Captures the whole feat/slm-live-monitor effort (12 commits): fan-out /monitor
feed consumption, L1/L10 chart lines, live-chart backfill, cache-populated live
panel with measuring/freshness, per-unit + panel refresh, admin keepalive toggle,
dashboard/command-center cache reads, and the dispatchEvent/CancelledError/
freshness fixes. Targets 0.14.0; Upgrade Notes flag the paired SLMM `dev` build +
its migrations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 20:17:10 +00:00
serversdown 5e3645e229 fix(slm): show real cache freshness in device list + draw L1/L10 on card chart
1. "No recent check-in" was always shown because the row's last-check text read
   unit.slm_last_check (a Terra-View roster field the monitor never updates),
   while the live freshness lives in SLMM's cached NL43Status.last_seen. Carry
   that last_seen onto the unit (unit.cache_last_seen) and display it (falling
   back to slm_last_check). Also treat "Measure" as Measuring in the badge, to
   match the panel and the cache's MEASURING_STATES.

2. The dashboard card chart only had Lp + Leq datasets, so L1/L10 never drew even
   though the cards showed them. Add L1 (purple) and L10 (orange) datasets and
   feed ln1/ln2 in both the /history backfill and the live /monitor frames.
   Percentiles parse via numOrNull so a missing "-.-" leaves a gap (spanGaps)
   instead of dropping the line to 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 19:52:57 +00:00
serversdown 88f258d1c7 chore: rename terra-view container for clarity. 2026-06-10 19:45:55 +00:00
serversdown 711ef41e5f feat(slm): auto-populate live panel from cache, per-unit refresh, fix badge overlap
Live Measurements panel no longer sits blank until you click Start Live Stream:
- On open it fills the KPI cards from the cached /status snapshot (lp/leq/lmax/
  L1/L10) and backfills the chart from the /history DOD trail — both pure cache
  reads, no device hit.
- Shows measuring state (● Measuring / ■ Stopped) and a freshness stamp
  ("as of 2:14 PM (12m ago)") that turns amber + "cached" when stale, so a cached
  value is never mistaken for a live reading.
- Polls the cache every 15s while open so the cards stay current without opening
  a device stream; Start Live Stream takes over (and no longer wipes the
  backfilled trail). Chart cap raised 60 -> 600 so the 2h backfill isn't truncated.

Refresh buttons (on-demand, user-initiated single device read via GET /live,
which also updates the cache):
- one per device row in the list, and one in the panel header. Spinner while in
  flight; toast on success/failure; reloads the list so badges + last-check update.

Layout fix: the status badge (Measuring/Active/Idle/Benched) was rendered at the
top-right of the card, colliding with the absolutely-positioned chart/gear icons.
Moved it to the bottom meta row next to "Last check", padded the card content
clear of the action icons, and added the refresh icon to that group.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 18:59:42 +00:00
serversdown e27aef33ac fix(slm): guard htmx.trigger so deploy/bench doesn't throw on pages without #slm-list
toggleSLMDeployed() and the save-config success path both called
htmx.trigger('#slm-list', 'load') guarded only by `typeof htmx !== 'undefined'`.
No page actually has a #slm-list element, so htmx resolved the selector to null
and called null.dispatchEvent(...) -> "can't access property dispatchEvent, e is
null". The deploy POST had already succeeded and the green success message had
already rendered, so the user saw both "Unit marked as deployed." and a red
error. Guard the trigger on the element existing so it's a harmless no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 18:34:15 +00:00
serversdown 170dedb138 perf(slm): command center loads from cached status, no device pings
get_live_view fired two device calls on every command-center load:
/measurement-state (sends Measure?) and /live (fresh DOD read) — competing
with the monitor's DOD polling. Both are now redundant: the keepalive monitor
keeps NL43Status fresh (~1.3s) and the live-stream WS handles ongoing updates.

Read the cached /status once instead (no device call); derive is_measuring
from measurement_state. Command center opens instantly without poking the
device. (Relies on monitor_start_time now being in /status.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 22:57:45 +00:00
serversdown d92d01dc56 fix(slm): dashboard status from SLMM's cached roster, not a device call
"No recent check-in" read a roster field (slm_last_check) that nothing
stamps, and the live-status fetch hit /measurement-state — which sends
Measure? to the DEVICE every refresh, competing with DOD polling.

Now read SLMM's /roster once: it carries each unit's cached NL43Status
(last_seen, measurement_state) — a cache read, no device call. is_recent is
derived from last_seen (advances only on a successful monitor poll, so
staleness == not being reached) within 5 min, for all non-retired units
(benched units can still be monitored). Net: fewer device calls AND the
dashboard reflects the live monitor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 22:27:01 +00:00
serversdown 17a1a83bdf feat(slm): point dashboard live tile at /monitor too
Finishes the live-view pivot: the SLM dashboard's live-chart tile now uses
the fan-out /monitor feed (multi-viewer, L1/L10) instead of the DRD /stream,
and skips heartbeat / unreachable frames so they don't blank the metrics or
spike the chart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 21:29:50 +00:00
serversdown f5e93d5612 feat(slm): backfill the live chart from the DOD trail on open
On opening the live view, fetch GET /api/slmm/{unit}/history?hours=2 and
seed the chart with the recent trend BEFORE connecting the live socket, so
it opens with context instead of blank. Live frames then append in order.

- backfillChart() populates all four series (Lp/Leq/L1/L10) from the trail.
- initLiveDataStream is async and awaits the backfill before opening the WS.
- Chart rolling window raised 60 -> 600 points so the ~2h backfill (1/min)
  isn't immediately shifted out.
- Trail timestamps are naive UTC -> append 'Z' so they localize consistently
  with the live frames.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 20:00:41 +00:00
serversdown bdc91177e2 feat(admin): per-unit live-monitoring (keepalive) toggle on /admin/slmm
Adds a "Live Monitoring (keepalive)" card listing each SLMM device with its
monitor_enabled state and an Enable/Disable toggle. Reads from /api/slmm/roster
(now includes monitor_enabled) and POSTs to /api/slmm/{unit}/monitor/{start,stop},
which persist the flag in SLMM (survives restarts; auto-started on boot). Shows a
reachability dot + 24/7 ON/OFF badge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 19:28:55 +00:00
serversdown 3b818dcd97 fix(slm): stop monitor proxy leaking CancelledError on stream stop
The /monitor WS proxy cancelled its sibling task on disconnect but then
`except Exception` failed to swallow the resulting CancelledError (a
BaseException), so stopping the stream raised "Exception in ASGI
application". It also only awaited the pending task, leaving the done
task's WebSocketDisconnect unretrieved ("Task exception was never
retrieved"). Await all tasks and catch (CancelledError, Exception).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 19:12:34 +00:00
serversdown 61b144efd2 feat(slm): plot L1/L10 lines on the live chart
The L1/L10 cards populated, but the chart only had Lp + Leq datasets, so
the percentiles weren't drawn. Add L1 (violet) and L10 (amber) lines —
pushed/shifted/cleared alongside Lp/Leq — so the chart shows all four.

(Legend labels are hardcoded L1/L10, matching the default percentile slots;
dynamic ln1_label/ln2_label on the chart is a follow-up if a job reconfigures
the device's Ln slots.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 19:07:42 +00:00
serversdown c56b7f6c99 feat(slm): wire unit live view to the /monitor fan-out feed
The SLM live view now consumes SLMM's shared DOD /monitor feed instead of
the per-client DRD /stream. This fixes the single-connection contention
(many viewers share one device feed) and finally puts L1/L10 in the live
chart (DRD couldn't carry percentiles).

- New WS proxy handler /api/slmm/{unit}/monitor -> SLMM /api/nl43/{unit}/monitor.
  Uses asyncio.wait(FIRST_COMPLETED) + cancel-sibling instead of gather(), so
  it doesn't leave a task sending into a closed socket ("Unexpected ASGI
  message after close").
- Live view JS points at /monitor; onmessage reflects feed_status and ignores
  heartbeat / unreachable frames so they don't blank the cards or zero-spike
  the chart. Adds a small Live/Device-offline badge.

Still on the old /live (DRD): the dashboard live tile (sound_level_meters.html)
— next slice.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 18:13:17 +00:00