Feat: add SLM live monitoring improvements #60

Merged
serversdown merged 13 commits from feat/slm-live-monitor into dev 2026-06-10 16:33:26 -04:00
Owner
No description provided.
serversdown added 13 commits 2026-06-10 16:32:54 -04:00
The SLM live view now consumes SLMM's shared DOD /monitor feed instead of
the per-client DRD /stream. This fixes the single-connection contention
(many viewers share one device feed) and finally puts L1/L10 in the live
chart (DRD couldn't carry percentiles).

- New WS proxy handler /api/slmm/{unit}/monitor -> SLMM /api/nl43/{unit}/monitor.
  Uses asyncio.wait(FIRST_COMPLETED) + cancel-sibling instead of gather(), so
  it doesn't leave a task sending into a closed socket ("Unexpected ASGI
  message after close").
- Live view JS points at /monitor; onmessage reflects feed_status and ignores
  heartbeat / unreachable frames so they don't blank the cards or zero-spike
  the chart. Adds a small Live/Device-offline badge.

Still on the old /live (DRD): the dashboard live tile (sound_level_meters.html)
— next slice.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The L1/L10 cards populated, but the chart only had Lp + Leq datasets, so
the percentiles weren't drawn. Add L1 (violet) and L10 (amber) lines —
pushed/shifted/cleared alongside Lp/Leq — so the chart shows all four.

(Legend labels are hardcoded L1/L10, matching the default percentile slots;
dynamic ln1_label/ln2_label on the chart is a follow-up if a job reconfigures
the device's Ln slots.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The /monitor WS proxy cancelled its sibling task on disconnect but then
`except Exception` failed to swallow the resulting CancelledError (a
BaseException), so stopping the stream raised "Exception in ASGI
application". It also only awaited the pending task, leaving the done
task's WebSocketDisconnect unretrieved ("Task exception was never
retrieved"). Await all tasks and catch (CancelledError, Exception).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a "Live Monitoring (keepalive)" card listing each SLMM device with its
monitor_enabled state and an Enable/Disable toggle. Reads from /api/slmm/roster
(now includes monitor_enabled) and POSTs to /api/slmm/{unit}/monitor/{start,stop},
which persist the flag in SLMM (survives restarts; auto-started on boot). Shows a
reachability dot + 24/7 ON/OFF badge.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On opening the live view, fetch GET /api/slmm/{unit}/history?hours=2 and
seed the chart with the recent trend BEFORE connecting the live socket, so
it opens with context instead of blank. Live frames then append in order.

- backfillChart() populates all four series (Lp/Leq/L1/L10) from the trail.
- initLiveDataStream is async and awaits the backfill before opening the WS.
- Chart rolling window raised 60 -> 600 points so the ~2h backfill (1/min)
  isn't immediately shifted out.
- Trail timestamps are naive UTC -> append 'Z' so they localize consistently
  with the live frames.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Finishes the live-view pivot: the SLM dashboard's live-chart tile now uses
the fan-out /monitor feed (multi-viewer, L1/L10) instead of the DRD /stream,
and skips heartbeat / unreachable frames so they don't blank the metrics or
spike the chart.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
"No recent check-in" read a roster field (slm_last_check) that nothing
stamps, and the live-status fetch hit /measurement-state — which sends
Measure? to the DEVICE every refresh, competing with DOD polling.

Now read SLMM's /roster once: it carries each unit's cached NL43Status
(last_seen, measurement_state) — a cache read, no device call. is_recent is
derived from last_seen (advances only on a successful monitor poll, so
staleness == not being reached) within 5 min, for all non-retired units
(benched units can still be monitored). Net: fewer device calls AND the
dashboard reflects the live monitor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
get_live_view fired two device calls on every command-center load:
/measurement-state (sends Measure?) and /live (fresh DOD read) — competing
with the monitor's DOD polling. Both are now redundant: the keepalive monitor
keeps NL43Status fresh (~1.3s) and the live-stream WS handles ongoing updates.

Read the cached /status once instead (no device call); derive is_measuring
from measurement_state. Command center opens instantly without poking the
device. (Relies on monitor_start_time now being in /status.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
toggleSLMDeployed() and the save-config success path both called
htmx.trigger('#slm-list', 'load') guarded only by `typeof htmx !== 'undefined'`.
No page actually has a #slm-list element, so htmx resolved the selector to null
and called null.dispatchEvent(...) -> "can't access property dispatchEvent, e is
null". The deploy POST had already succeeded and the green success message had
already rendered, so the user saw both "Unit marked as deployed." and a red
error. Guard the trigger on the element existing so it's a harmless no-op.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Live Measurements panel no longer sits blank until you click Start Live Stream:
- On open it fills the KPI cards from the cached /status snapshot (lp/leq/lmax/
  L1/L10) and backfills the chart from the /history DOD trail — both pure cache
  reads, no device hit.
- Shows measuring state (● Measuring / ■ Stopped) and a freshness stamp
  ("as of 2:14 PM (12m ago)") that turns amber + "cached" when stale, so a cached
  value is never mistaken for a live reading.
- Polls the cache every 15s while open so the cards stay current without opening
  a device stream; Start Live Stream takes over (and no longer wipes the
  backfilled trail). Chart cap raised 60 -> 600 so the 2h backfill isn't truncated.

Refresh buttons (on-demand, user-initiated single device read via GET /live,
which also updates the cache):
- one per device row in the list, and one in the panel header. Spinner while in
  flight; toast on success/failure; reloads the list so badges + last-check update.

Layout fix: the status badge (Measuring/Active/Idle/Benched) was rendered at the
top-right of the card, colliding with the absolutely-positioned chart/gear icons.
Moved it to the bottom meta row next to "Last check", padded the card content
clear of the action icons, and added the refresh icon to that group.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
1. "No recent check-in" was always shown because the row's last-check text read
   unit.slm_last_check (a Terra-View roster field the monitor never updates),
   while the live freshness lives in SLMM's cached NL43Status.last_seen. Carry
   that last_seen onto the unit (unit.cache_last_seen) and display it (falling
   back to slm_last_check). Also treat "Measure" as Measuring in the badge, to
   match the panel and the cache's MEASURING_STATES.

2. The dashboard card chart only had Lp + Leq datasets, so L1/L10 never drew even
   though the cards showed them. Add L1 (purple) and L10 (orange) datasets and
   feed ln1/ln2 in both the /history backfill and the live /monitor frames.
   Percentiles parse via numOrNull so a missing "-.-" leaves a gap (spanGaps)
   instead of dropping the line to 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Captures the whole feat/slm-live-monitor effort (12 commits): fan-out /monitor
feed consumption, L1/L10 chart lines, live-chart backfill, cache-populated live
panel with measuring/freshness, per-unit + panel refresh, admin keepalive toggle,
dashboard/command-center cache reads, and the dispatchEvent/CancelledError/
freshness fixes. Targets 0.14.0; Upgrade Notes flag the paired SLMM `dev` build +
its migrations.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
serversdown merged commit 182e224f3c into dev 2026-06-10 16:33:26 -04:00
Sign in to join this conversation.