• serversdown released this 2026-06-17 16:43:32 -04:00 | 0 commits to main since this release

    SLM live monitoring — fan-out feed + cache-first reads. Targets 0.14.0. The throughline: the NL-43 allows exactly one TCP connection at a time, so every page that opened its own device stream (or sent its own Measure?/DOD on load) was competing for that single connection — a second viewer saw nothing, and dashboard loads stole polling resolution from the live feed. This release moves Terra-View entirely onto SLMM's shared, cached monitoring: one DOD poll loop per device, fanned out to all viewers; dashboards read SLMM's cache (a DB read on SLMM's side) instead of touching the device; and the live panels populate instantly from cache on open, upgrading to the live WS only on demand. Paired with the SLMM-side work (adaptive poll rate, unreachable backoff, device-offline alert) on SLMM branch dev.

    Added

    • Fan-out /monitor feed consumption. The unit live view (partials/slm_live_view.html) and the dashboard live tile (sound_level_meters.html) now subscribe to SLMM's shared per-device monitor over WS /api/slmm/{unit}/monitor instead of each opening its own device stream. Any number of clients attach without each consuming the NL-43's single connection — the "second viewer sees nothing" contention is gone. A WS proxy handler for /monitor was added to backend/routers/slmm.py.
    • L1/L10 percentile lines + cards. Both the per-unit live chart and the dashboard card chart now plot L1 (purple) and L10 (orange) alongside Lp/Leq, and the KPI cards show L1/L10. Sourced from the DOD feed's ln1/ln2 (DRD streaming can't carry percentiles, DOD can). Missing/-.- values leave a gap rather than dropping the line to 0.
    • Live-chart backfill on open. Charts seed from SLMM's downsampled DOD trail (GET /api/slmm/{unit}/history?hours=2) so a viewer sees recent trend immediately instead of a blank chart that fills one point per second.
    • Live Measurements panel auto-populates from cache. Opening the dashboard panel fills the KPI cards from cached /status and backfills the chart from /history — pure cache reads, no device hit. Shows a measuring badge (● Measuring / ■ Stopped) and a freshness stamp ("as of 3:48 PM (10s ago)", amber + "cached" when stale). Re-polls the cache every 15s while open; Start Live Stream upgrades to the live WS and no longer wipes the backfilled trail (chart point cap raised 60 → 600).
    • Refresh buttons — one per device-list row, one in the panel header. On-demand, user-initiated single device read via GET /api/slmm/{unit}/live (which also refreshes SLMM's cache), with a spinner + success/error toast, then reloads the device list.
    • Per-unit live-monitoring (keepalive) toggle on /admin/slmm — turns a device's server-side keepalive feed on/off (POST /monitor/start|stop), so alerting can keep a device's feed running with no browser attached.

    Changed

    • Dashboard device list + command center read SLMM's cache, not the device. slm_dashboard.py's get_slm_units pulls each unit's cached status from SLMM's /roster (one call, a SLMM DB read) for the badge + freshness; the command-center get_live_view reads cached /status instead of sending Measure? + a fresh DOD on every load. This stops dashboard loads from stealing the device's single connection from the live monitor. The elapsed-measurement timer still works because measurement_start_time is now included in the cached /status response.
    • Device-list freshness reflects real monitoring. The "Last check" line now uses SLMM's cached last_seen (which the monitor advances on every successful poll) via unit.cache_last_seen, instead of the slm_last_check roster field the monitor never updates. The status badge also treats Measure as Measuring, matching the panel and SLMM's cache.
    • Status badge relocated to the card's bottom meta row (next to "Last check"), off the top-right corner where it collided with the chart/gear/refresh action icons.

    Fixed

    • Deploy/bench threw can't access property "dispatchEvent", e is null. toggleSLMDeployed() and the save-config path called htmx.trigger('#slm-list', 'load') guarded only by typeof htmx !== 'undefined'; no page has a #slm-list, so htmx resolved null and called null.dispatchEvent(...). The deploy POST had already succeeded, so the operator saw both the green success and a red error. Both call sites now guard on the element existing (slm_settings_modal.html).
    • Monitor WS proxy leaked CancelledError / "task exception never retrieved" on stream stop — the cleanup awaited pending tasks but only caught Exception, missing CancelledError (a BaseException).
    • "No recent check-in" shown even on an actively-monitored device — the row read the stale slm_last_check roster field instead of SLMM's live cache (see Changed).
    • L1/L10 KPI cards populated but the chart drew no L1/L10 lines — the card chart only had Lp + Leq datasets.

    Upgrade Notes

    Requires the matching SLMM build (branch dev) — Terra-View now depends on SLMM's fan-out /monitor feed, /history trail, /status carrying ln1/ln2 + measurement_start_time, cached /roster status, and the monitor_enabled keepalive flag.

    # SLMM (branch dev) — REBUILD + MIGRATE (or you'll get `no such column: nl43_status.ln1` 500s)
    cd /home/serversdown/slmm && docker compose build slmm && docker compose up -d slmm
    docker exec terra-view-slmm-1 python3 migrate_add_ln_percentiles.py
    docker exec terra-view-slmm-1 python3 migrate_add_monitor_enabled.py
    
    # Terra-View — NO migration; templates are baked into the image, so rebuild (don't just restart)
    cd /home/serversdown/terra-view && docker compose build terra-view && docker compose up -d terra-view
    

    The two builds must ship together. Note the docker-compose.yml container was renamed for clarity (now terra-view-terra-view-1) — adjust any docker exec scripts that referenced the old name.


    Client portal (new — read-only client-facing view)

    A scoped, read-only portal at /portal/* where a client sees only their
    locations, live. Built inside Terra-View (no new service), reusing the cached
    SLMM feed; every route resolves the client through one swappable
    get_current_client gate, so the interim magic/open-link auth can be replaced
    (M4) without touching routes or templates. Strictly read-only — no device control.

    Added

    • Per-client scoping + interim auth. New Client, ClientAccessToken, and a
      Project.client_id FK. A signed (HMAC) session cookie carries the access-token
      id, re-validated against the DB each request (revoke kills live sessions, with
      server-side expiry). Entry via a magic link (/portal/enter/{token}) or a
      dev-only plain link (/portal/open/{id}, PORTAL_OPEN_LINKS, default off).
    • Live location view. KPI cards (Lp/Leq/Lmax/L1/L10) + chart populate
      instantly from cache, then upgrade to a real ~1 Hz WebSocket stream scoped to
      the client's unit (a scrubbed bridge to the SLMM fan-out feed). The stream
      auto-closes when the tab is hidden (Page Visibility) and after a 15-min idle
      cap, so an abandoned tab can't pin the device at 1 Hz / burn cellular.
    • Locations overview. Live status map (level-colored dots, dark/light CARTO
      tiles) + a status rollup (live/offline counts, "loudest now"). Leq is the
      headline metric.
    • Alerts (config → surface → 24/7). Threshold-rule config on the SLM detail
      page (proxying SLMM's alert CRUD); breach history + ack internally and a
      read-only, scrubbed history + current-alarm banner + "your alert limits" panel
      in the portal; enabling a rule pins that device's monitor on so alerts evaluate
      round-the-clock.
    • Operator sharing tools. A "View client portal" preview button and a
      "Copy client link" modal (mint / list / revoke magic links) on the project
      page, plus a backend/portal_admin.py CLI.
    • Field-instrument design. Distinctive themed portal — Hanken Grotesk UI +
      IBM Plex Mono readouts, panel system, pulsing live dot, staggered reveal — with a
      light/dark toggle (light default, persisted, no-flash).

    Security

    • All scoping enforced server-side (404-not-403, no existence leak); client
      endpoints return scrubbed projections (no device-health/internal ids); WS
      frames whitelisted; operator-set strings HTML-escaped before injection (XSS).
      Pre-merge code review hardened cookie expiry, open-links default, and the slug
      collision. Remaining hardening (reverse proxy, TLS, SECRET_KEY, M4 auth) is
      tracked in docs/CLIENT_PORTAL.md → "Security hardening backlog".

    Upgrade Notes

    • Migration: docker compose exec web-app python3 backend/migrate_add_client_portal.py
      (adds projects.client_id; the clients / client_access_tokens tables
      auto-create).
    • Set a real SECRET_KEY in any internet-facing env (signs session cookies),
      and keep PORTAL_OPEN_LINKS=false there.
    • Portal alerts depend on the SLMM dev alert engine (rules/events/evaluator +
      cooldown + keepalive coupling) — same build pairing as above.

    Portal authentication (Phase 1)

    • Each project's client portal is now gated by a secure per-project link + shared password (argon2-hashed). Operators manage it from the project page's Portal access panel (enable, generate password, copy link).
    • Per-project session isolation (a session for one project can't read another's data); brute-force lockout (5 tries / 15 min) on the password gate.
    • Retired the interim magic-link / PORTAL_OPEN_LINKS open links and the portal_admin.py mint-link command.
    • Upgrade: new argon2-cffi dependency → rebuild the image, then run python3 backend/migrate_add_project_portal_auth.py per DB (adds the projects.portal_* columns). SECRET_KEY and COOKIE_SECURE are now passed through in docker-compose.yml (settable via a .env file) — set a real SECRET_KEY (and COOKIE_SECURE=true once on HTTPS) before the portal faces the internet.
    Downloads