invalidate() only dropped the rule cache, not the per-(unit,rule) state machine —
so editing a rule's metric/threshold left a stale 'active' phase that mis-evaluated
against the new config (spurious clear, or suppressed onset), and deleting an
in-alarm rule left an open AlertEvent that kept the client portal stuck "in alarm"
forever. update/delete now call _reset_rule_runtime: forget_rule() drops the state
machine and any open event for that rule is closed.
Verified: existing evaluator tests + cooldown scenario still pass; compiles.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cooldown_s was stored + shown in the UI but never read, so a repeatedly-breaching
signal (e.g. intermittent traffic noise) would flood the alert history with an
event per spike. The evaluator now suppresses a new onset within cooldown_s of the
last, holding the edge so it fires the moment the window lapses if still breaching.
Hysteresis still gates clears. getattr-guarded so partial rule fixtures don't crash.
Verified: existing 4 evaluator tests pass; cooldown scenario (onset → clear →
suppressed re-breach → onset after window) passes.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three changes to cut wasted device/cellular load and surface outages:
- Adaptive interval: full-rate (~1.25s) while a browser is subscribed for a
smooth chart; relaxed cadence (MONITOR_IDLE_POLL_INTERVAL, default 10s) when
the feed is keepalive-only (alerting). ~8x fewer polls with no viewer ->
~8x less cellular traffic on a metered SIM. Note: idle interval also sets the
alert sampling resolution when nobody is watching.
- Exponential backoff when the device is unreachable (1->2->...->60s cap),
reset on the first good poll, so a dead/asleep device stops churning
reconnects (log spam + wasted SYN traffic). Capped at 5s while a browser is
watching so a recovery still surfaces quickly.
- Device-offline alert: the reachable->unreachable transition raises a
connectivity AlertEvent (sentinel rule_id=0, metric="connectivity") through
the existing evaluator/dispatch seam; recovery clears it. Deduped in memory
and via the DB (so a restart mid-outage doesn't duplicate the event).
MonitorManager.status() now reports reachable + current mode (watched/idle/
backoff) for observability.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Replaces the POC single-threshold check with a real per-rule engine over
the live monitor feed.
- AlertRule / AlertEvent tables (auto-created via create_all; no migration).
Rule = {metric, comparison, threshold_db, duration_s, clear_margin_db,
schedule, channels, recipients}.
- alerts.py: per-(unit,rule) state machine IDLE->ACTIVE->IDLE with duration
debounce (both edges) + clear_margin hysteresis; onset/clear are distinct
events; optional nighttime schedule; rule cache w/ invalidation. The
state-machine core (_evaluate_step) is pure (no DB/clock) for testing.
- Dispatch is a server log (POC); _dispatch() is the seam for a Terra-View
webhook (email/SMS) later.
- CRUD: POST/GET/PUT/DELETE /{unit}/alerts/rules, GET /{unit}/alerts/events,
POST /{unit}/alerts/events/{id}/ack.
- test_alert_evaluator.py: synthetic level series proves onset debounce,
spike rejection, hysteresis hold, and below-comparison (4/4 pass, no device).
Source-agnostic: the same rules transfer unchanged if a unit's feed is later
sourced from FTP intervals instead of the DOD monitor.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The piece the live-view + alerting work was building toward.
monitor.py — one DOD poll loop per device, broadcast to many subscribers:
- browser WebSockets (fixes the single-connection "second viewer sees
nothing" contention — browsers no longer each open a device stream)
- the alert evaluator (can keep a feed running with no browser via
/monitor/start, so alerting runs continuously)
- persistence (each snapshot written like the poller)
DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls
go through the existing per-device lock + pool, so it serializes safely with
the background poller and on-demand commands.
alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds
ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule
(instantaneous vs sustained vs L10) is the single swap point; dispatch is a
server log for now (email/SMS later).
Endpoints:
- WS /api/nl43/{unit_id}/monitor subscribe to the shared feed
- POST /api/nl43/{unit_id}/monitor/start keep feed alive w/o a browser
- POST /api/nl43/{unit_id}/monitor/stop drop the keep-alive
- GET /api/nl43/_monitor/status running/subscribers/keepalive
WS endpoint races queue.get() against a disconnect watcher so an idle feed
still detects client drop and doesn't leak a subscription.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>