diff --git a/CHANGELOG.md b/CHANGELOG.md index edba477..908f277 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -5,6 +5,50 @@ All notable changes to Terra-View will be documented in this file. The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html). +## [Unreleased] + +SLM live monitoring — fan-out feed + cache-first reads. Targets **0.14.0**. The throughline: the NL-43 allows exactly **one** TCP connection at a time, so every page that opened its own device stream (or sent its own `Measure?`/DOD on load) was competing for that single connection — a second viewer saw nothing, and dashboard loads stole polling resolution from the live feed. This release moves Terra-View entirely onto SLMM's shared, cached monitoring: one DOD poll loop per device, fanned out to all viewers; dashboards read SLMM's cache (a DB read on SLMM's side) instead of touching the device; and the live panels populate instantly from cache on open, upgrading to the live WS only on demand. Paired with the SLMM-side work (adaptive poll rate, unreachable backoff, device-offline alert) on SLMM branch `dev`. + +### Added + +- **Fan-out `/monitor` feed consumption.** The unit live view (`partials/slm_live_view.html`) and the dashboard live tile (`sound_level_meters.html`) now subscribe to SLMM's shared per-device monitor over `WS /api/slmm/{unit}/monitor` instead of each opening its own device stream. Any number of clients attach without each consuming the NL-43's single connection — the "second viewer sees nothing" contention is gone. A WS proxy handler for `/monitor` was added to `backend/routers/slmm.py`. +- **L1/L10 percentile lines + cards.** Both the per-unit live chart and the dashboard card chart now plot L1 (purple) and L10 (orange) alongside Lp/Leq, and the KPI cards show L1/L10. Sourced from the DOD feed's `ln1`/`ln2` (DRD streaming can't carry percentiles, DOD can). Missing/`-.-` values leave a gap rather than dropping the line to 0. +- **Live-chart backfill on open.** Charts seed from SLMM's downsampled DOD trail (`GET /api/slmm/{unit}/history?hours=2`) so a viewer sees recent trend immediately instead of a blank chart that fills one point per second. +- **Live Measurements panel auto-populates from cache.** Opening the dashboard panel fills the KPI cards from cached `/status` and backfills the chart from `/history` — pure cache reads, no device hit. Shows a measuring badge (● Measuring / ■ Stopped) and a freshness stamp ("as of 3:48 PM (10s ago)", amber + "cached" when stale). Re-polls the cache every 15s while open; **Start Live Stream** upgrades to the live WS and no longer wipes the backfilled trail (chart point cap raised 60 → 600). +- **Refresh buttons** — one per device-list row, one in the panel header. On-demand, user-initiated single device read via `GET /api/slmm/{unit}/live` (which also refreshes SLMM's cache), with a spinner + success/error toast, then reloads the device list. +- **Per-unit live-monitoring (keepalive) toggle on `/admin/slmm`** — turns a device's server-side keepalive feed on/off (`POST /monitor/start|stop`), so alerting can keep a device's feed running with no browser attached. + +### Changed + +- **Dashboard device list + command center read SLMM's cache, not the device.** `slm_dashboard.py`'s `get_slm_units` pulls each unit's cached status from SLMM's `/roster` (one call, a SLMM DB read) for the badge + freshness; the command-center `get_live_view` reads cached `/status` instead of sending `Measure?` + a fresh DOD on every load. This stops dashboard loads from stealing the device's single connection from the live monitor. The elapsed-measurement timer still works because `measurement_start_time` is now included in the cached `/status` response. +- **Device-list freshness reflects real monitoring.** The "Last check" line now uses SLMM's cached `last_seen` (which the monitor advances on every successful poll) via `unit.cache_last_seen`, instead of the `slm_last_check` roster field the monitor never updates. The status badge also treats `Measure` as Measuring, matching the panel and SLMM's cache. +- **Status badge relocated** to the card's bottom meta row (next to "Last check"), off the top-right corner where it collided with the chart/gear/refresh action icons. + +### Fixed + +- **Deploy/bench threw `can't access property "dispatchEvent", e is null`.** `toggleSLMDeployed()` and the save-config path called `htmx.trigger('#slm-list', 'load')` guarded only by `typeof htmx !== 'undefined'`; no page has a `#slm-list`, so htmx resolved null and called `null.dispatchEvent(...)`. The deploy POST had already succeeded, so the operator saw both the green success **and** a red error. Both call sites now guard on the element existing (`slm_settings_modal.html`). +- **Monitor WS proxy leaked `CancelledError` / "task exception never retrieved"** on stream stop — the cleanup awaited pending tasks but only caught `Exception`, missing `CancelledError` (a `BaseException`). +- **"No recent check-in" shown even on an actively-monitored device** — the row read the stale `slm_last_check` roster field instead of SLMM's live cache (see Changed). +- **L1/L10 KPI cards populated but the chart drew no L1/L10 lines** — the card chart only had Lp + Leq datasets. + +### Upgrade Notes + +Requires the **matching SLMM build (branch `dev`)** — Terra-View now depends on SLMM's fan-out `/monitor` feed, `/history` trail, `/status` carrying `ln1`/`ln2` + `measurement_start_time`, cached `/roster` status, and the `monitor_enabled` keepalive flag. + +```bash +# SLMM (branch dev) — REBUILD + MIGRATE (or you'll get `no such column: nl43_status.ln1` 500s) +cd /home/serversdown/slmm && docker compose build slmm && docker compose up -d slmm +docker exec terra-view-slmm-1 python3 migrate_add_ln_percentiles.py +docker exec terra-view-slmm-1 python3 migrate_add_monitor_enabled.py + +# Terra-View — NO migration; templates are baked into the image, so rebuild (don't just restart) +cd /home/serversdown/terra-view && docker compose build terra-view && docker compose up -d terra-view +``` + +The two builds must ship **together**. Note the `docker-compose.yml` container was renamed for clarity (now `terra-view-terra-view-1`) — adjust any `docker exec` scripts that referenced the old name. + +--- + ## [0.13.3] - 2026-06-05 Calibration sync from SFM events. Closes the manual data-entry loop on calibration dates — Terra-View now pulls `device.calibration_date` from each seismograph's most recent event sidecar once a day and updates `RosterUnit.last_calibrated` when the device reports something fresher than what's stored. Manual edits still win when they're newer than the latest event; a fresh event arriving later supersedes the manual edit. Adds a "Sync now" button under Settings → Advanced → Calibration Defaults for on-demand runs, and a `docs/ROADMAP.md` to track in-flight + deferred work. diff --git a/backend/routers/slm_dashboard.py b/backend/routers/slm_dashboard.py index 3b93488..d35746c 100644 --- a/backend/routers/slm_dashboard.py +++ b/backend/routers/slm_dashboard.py @@ -91,29 +91,43 @@ async def get_slm_units( one_hour_ago = datetime.utcnow() - timedelta(hours=1) for unit in units: + # Legacy default from the roster field; refined from SLMM's cached status below. unit.is_recent = bool(unit.slm_last_check and unit.slm_last_check > one_hour_ago) + unit.measurement_state = None + unit.cache_last_seen = None # SLMM cache last_seen (real monitoring freshness) if include_measurement: - async def fetch_measurement_state(client: httpx.AsyncClient, unit_id: str) -> str | None: - try: - response = await client.get(f"{SLMM_BASE_URL}/api/nl43/{unit_id}/measurement-state") - if response.status_code == 200: - return response.json().get("measurement_state") - except Exception: - return None - return None - - deployed_units = [unit for unit in units if unit.deployed and not unit.retired] - if deployed_units: + # SLMM's /roster carries each unit's CACHED status (last_seen, + # measurement_state) from NL43Status — a DB read on SLMM's side, NOT a device + # call. The live monitor refreshes that cache ~every 1.3s, so this reflects + # real monitoring without sending Measure? to the device (which the old + # /measurement-state did) and competing with DOD polling. One call covers all. + slmm_status = {} + try: async with httpx.AsyncClient(timeout=3.0) as client: - tasks = [fetch_measurement_state(client, unit.id) for unit in deployed_units] - results = await asyncio.gather(*tasks, return_exceptions=True) + r = await client.get(f"{SLMM_BASE_URL}/api/nl43/roster") + if r.status_code == 200: + for dev in (r.json().get("devices") or []): + slmm_status[dev.get("unit_id")] = dev.get("status") or {} + except Exception: + slmm_status = {} - for unit, state in zip(deployed_units, results): - if isinstance(state, Exception): - unit.measurement_state = None - else: - unit.measurement_state = state + # "Recent" = the monitor has a fresh successful read. last_seen only advances + # on a successful poll, so staleness == the device isn't being reached. + recent_cutoff = datetime.utcnow() - timedelta(minutes=5) + for unit in units: + st = slmm_status.get(unit.id) + if not st: + continue + unit.measurement_state = st.get("measurement_state") + last_seen = st.get("last_seen") + if last_seen: + try: + ls = datetime.fromisoformat(last_seen.replace("Z", "")) + unit.is_recent = ls > recent_cutoff + unit.cache_last_seen = ls # the real freshness the monitor updates + except Exception: + pass return templates.TemplateResponse("partials/slm_device_list.html", { "request": request, @@ -157,25 +171,18 @@ async def get_live_view(request: Request, unit_id: str, db: Session = Depends(ge is_measuring = False try: - async with httpx.AsyncClient(timeout=10.0) as client: - # Get measurement state - state_response = await client.get( - f"{SLMM_BASE_URL}/api/nl43/{unit_id}/measurement-state" - ) - if state_response.status_code == 200: - state_data = state_response.json() - measurement_state = state_data.get("measurement_state", "Unknown") - is_measuring = state_data.get("is_measuring", False) - - # Get live status (measurement_start_time is already stored in SLMM database) - status_response = await client.get( - f"{SLMM_BASE_URL}/api/nl43/{unit_id}/live" - ) - if status_response.status_code == 200: - status_data = status_response.json() - current_status = status_data.get("data", {}) + # Read SLMM's CACHED status (NL43Status) — no device call. The live monitor + # keeps it fresh (~1.3s) and the live-stream WS provides ongoing updates, so we + # no longer fire Measure? + a fresh DOD read at the device on every command- + # center load (which competed with DOD polling for the single connection). + async with httpx.AsyncClient(timeout=5.0) as client: + r = await client.get(f"{SLMM_BASE_URL}/api/nl43/{unit_id}/status") + if r.status_code == 200: + current_status = r.json().get("data", {}) + measurement_state = current_status.get("measurement_state") + is_measuring = measurement_state in ("Start", "Measure") except Exception as e: - logger.error(f"Failed to get status for {unit_id}: {e}") + logger.error(f"Failed to get cached status for {unit_id}: {e}") return templates.TemplateResponse("partials/slm_live_view.html", { "request": request, diff --git a/backend/routers/slmm.py b/backend/routers/slmm.py index 1c73f5e..62a0385 100644 --- a/backend/routers/slmm.py +++ b/backend/routers/slmm.py @@ -231,6 +231,76 @@ async def proxy_websocket_live(websocket: WebSocket, unit_id: str): logger.info(f"WebSocket proxy closed for {unit_id} (live)") +@router.websocket("/{unit_id}/monitor") +async def proxy_websocket_monitor(websocket: WebSocket, unit_id: str): + """ + Proxy WebSocket connections to SLMM's /monitor (fan-out DOD feed). + + This is the shared ~1Hz DOD feed: many clients subscribe to one device feed + (no single-connection contention) and it carries L1/L10 (which the DRD + /stream cannot). Preferred over /stream for the live view. + """ + await websocket.accept() + logger.info(f"WebSocket accepted for SLMM unit {unit_id} (monitor)") + + target_ws_url = f"{SLMM_WS_BASE_URL}/api/nl43/{unit_id}/monitor" + backend_ws = None + + try: + backend_ws = await websockets.connect(target_ws_url) + logger.info(f"Connected to SLMM monitor feed for {unit_id}") + + async def forward_to_client(): + """Backend monitor frames -> browser.""" + async for message in backend_ws: + await websocket.send_text(message) + + async def watch_client(): + """Drain client frames; raises WebSocketDisconnect on close so we can + tear the pair down (the monitor feed is server->client only).""" + while True: + await websocket.receive_text() + + # When EITHER side ends (browser disconnects or backend closes), cancel the + # other immediately — avoids sending into a closed socket (the + # "Unexpected ASGI message after close" race that asyncio.gather leaves open). + tasks = [asyncio.ensure_future(forward_to_client()), + asyncio.ensure_future(watch_client())] + done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED) + for t in pending: + t.cancel() + # Await ALL tasks (the done one AND the cancelled one) and swallow both + # the expected WebSocketDisconnect and CancelledError. CancelledError is a + # BaseException, so a bare `except Exception` misses it — that's what leaked + # the traceback on stop; and awaiting only `pending` left the done task's + # exception unretrieved. + for t in tasks: + try: + await t + except (asyncio.CancelledError, Exception): + pass + + except websockets.exceptions.WebSocketException as e: + logger.error(f"WebSocket error connecting to SLMM monitor for {unit_id}: {e}") + try: + await websocket.send_json({"error": "Failed to connect to SLMM monitor", "detail": str(e)}) + except Exception: + pass + except Exception as e: + logger.error(f"Unexpected error in monitor proxy for {unit_id}: {e}") + finally: + if backend_ws: + try: + await backend_ws.close() + except Exception: + pass + try: + await websocket.close() + except Exception: + pass + logger.info(f"WebSocket monitor proxy closed for {unit_id}") + + # HTTP catch-all route MUST come after specific routes (including WebSocket routes) @router.api_route("/{path:path}", methods=["GET", "POST", "PUT", "DELETE", "PATCH"]) async def proxy_to_slmm(path: str, request: Request): diff --git a/docker-compose.yml b/docker-compose.yml index dddde41..74c045c 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -1,6 +1,6 @@ services: - terra-view: + web-app: build: . ports: - "8001:8001" diff --git a/templates/admin_slmm.html b/templates/admin_slmm.html index c9056b1..2b88dcd 100644 --- a/templates/admin_slmm.html +++ b/templates/admin_slmm.html @@ -42,6 +42,18 @@ + +
+ Keepalive runs the 1 Hz DOD feed 24/7 (even with no viewer), which powers the live-chart + trail and continuous threshold alerts. Toggling persists and survives restarts. +
+Loading…
+No devices configured.
'; + return; + } + el.innerHTML = devices.map(dev => { + const on = !!dev.monitor_enabled; + const reach = dev.status ? dev.status.is_reachable : null; + const reachDot = reach === false + ? '' + : ''; + return ` +Failed to load devices: ${_esc(e.message)}
`; + } +} + +async function toggleMonitor(unitId, enable) { + const action = enable ? 'start' : 'stop'; + try { + const r = await fetch(`/api/slmm/${encodeURIComponent(unitId)}/monitor/${action}`, { method: 'POST' }); + if (!r.ok) throw new Error('HTTP ' + r.status); + await loadMonitors(); + } catch (e) { + alert('Toggle failed: ' + e.message); + } +} + loadSlmmOverview(); -setInterval(loadSlmmOverview, 30000); +loadMonitors(); +setInterval(() => { loadSlmmOverview(); loadMonitors(); }, 30000); {% endblock %} diff --git a/templates/partials/slm_device_list.html b/templates/partials/slm_device_list.html index 117decb..56e47a7 100644 --- a/templates/partials/slm_device_list.html +++ b/templates/partials/slm_device_list.html @@ -2,7 +2,14 @@ {% if units %} {% for unit in units %}{{ unit.address }}
- {% elif unit.location %} -{{ unit.location }}
+ +{{ unit.address }}
+ {% elif unit.location %} +{{ unit.location }}
{% endif %}