feat: per-device live monitor (fan-out) + alert evaluator (POC)

The piece the live-view + alerting work was building toward.

monitor.py — one DOD poll loop per device, broadcast to many subscribers:
- browser WebSockets (fixes the single-connection "second viewer sees
  nothing" contention — browsers no longer each open a device stream)
- the alert evaluator (can keep a feed running with no browser via
  /monitor/start, so alerting runs continuously)
- persistence (each snapshot written like the poller)
DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls
go through the existing per-device lock + pool, so it serializes safely with
the background poller and on-demand commands.

alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds
ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule
(instantaneous vs sustained vs L10) is the single swap point; dispatch is a
server log for now (email/SMS later).

Endpoints:
- WS   /api/nl43/{unit_id}/monitor          subscribe to the shared feed
- POST /api/nl43/{unit_id}/monitor/start    keep feed alive w/o a browser
- POST /api/nl43/{unit_id}/monitor/stop     drop the keep-alive
- GET  /api/nl43/_monitor/status            running/subscribers/keepalive

WS endpoint races queue.get() against a disconnect watcher so an idle feed
still detects client drop and doesn't leak a subscription.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-06-08 23:27:05 +00:00
parent 8c17af4849
commit aa3e088b64
3 changed files with 321 additions and 0 deletions
+74
View File
@@ -246,6 +246,80 @@ async def system_resume():
return {"status": "ok", "mode": "active", "message": "Polling resumed"}
# ============================================================================
# LIVE MONITOR (fan-out) — one DOD feed per device, broadcast to many clients
# ============================================================================
@router.websocket("/{unit_id}/monitor")
async def monitor_stream(websocket: WebSocket, unit_id: str):
"""Subscribe a browser to the device's shared 1 Hz DOD feed.
Any number of clients can attach without each opening its own device
connection (one poll loop per device, fanned out). Same JSON shape as the
DRD stream, but DOD-sourced so it includes ln1/ln2 (L1/L10).
"""
await websocket.accept()
from app.monitor import monitor_manager
monitor = await monitor_manager.get(unit_id)
queue = await monitor.subscribe()
logger.info(f"Monitor subscriber attached for {unit_id} ({monitor.subscriber_count()} total)")
async def _watch_disconnect():
# Completes when the client disconnects, so an idle feed (no data) still
# detects the drop and we don't leak a subscription that keeps the device
# feed (and its connection) alive.
try:
while True:
msg = await websocket.receive()
if msg.get("type") == "websocket.disconnect":
return
except Exception:
return
gone = asyncio.ensure_future(_watch_disconnect())
try:
while not gone.done():
try:
payload = await asyncio.wait_for(queue.get(), timeout=1.0)
except asyncio.TimeoutError:
continue # re-check gone.done()
await websocket.send_json(payload)
except WebSocketDisconnect:
logger.info(f"Monitor subscriber disconnected for {unit_id}")
except Exception as e:
logger.warning(f"Monitor stream error for {unit_id}: {e}")
finally:
gone.cancel()
await monitor.unsubscribe(queue)
@router.post("/{unit_id}/monitor/start")
async def monitor_start(unit_id: str):
"""Keep the device's feed running even with no browser attached, so alerting
evaluates continuously. Runtime-only (resets on restart)."""
from app.monitor import monitor_manager
monitor = await monitor_manager.get(unit_id)
await monitor.set_keepalive(True)
return {"status": "ok", "unit_id": unit_id, "running": monitor.running, "keepalive": True}
@router.post("/{unit_id}/monitor/stop")
async def monitor_stop(unit_id: str):
"""Drop the keep-alive; the feed stops once no browser subscribers remain."""
from app.monitor import monitor_manager
monitor = await monitor_manager.get(unit_id)
await monitor.set_keepalive(False)
return {"status": "ok", "unit_id": unit_id, "keepalive": False}
@router.get("/_monitor/status")
async def monitor_status():
"""Status of every device monitor (running, subscriber count, keep-alive)."""
from app.monitor import monitor_manager
return {"status": "ok", "monitors": monitor_manager.status()}
# ============================================================================
# GLOBAL POLLING STATUS ENDPOINT (must be before /{unit_id} routes)
# ============================================================================