Makes live monitoring (and therefore alerting) genuinely 24/7 and
restart-surviving, instead of runtime-only keepalive.
- NL43Config.monitor_enabled (default True) + migrate_add_monitor_enabled.py.
- On startup, auto-start keepalive monitors for every monitor_enabled +
tcp_enabled unit — so feeds/alerts resume after a restart with no manual step.
- /monitor/start and /monitor/stop now PERSIST monitor_enabled (start=True,
stop=False) in addition to applying keepalive at runtime, so the toggle
sticks. Roster output includes monitor_enabled for the admin UI to read.
On by default: configure a unit -> it's monitored 24/7 unless toggled off.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Each monitor poll was sending DOD? + Measure? (two commands), and the NL43
enforces >=1s between commands, so updates were ~2.5s apart. The run state
changes rarely, so cache it and refresh via Measure? only every
MONITOR_STATE_REFRESH_S (default 30s); most polls now send just DOD? (one
rate-limited command) -> ~1.3s/update. Also trim MONITOR_POLL_INTERVAL to
0.25s since the device rate-limit is the real pacer.
request_dod() gains an optional measurement_state arg: when supplied it
reuses that state and skips the Measure? round-trip; None preserves the old
query-every-time behavior.
~1Hz is the device floor for DOD (the >=1s command spacing); DRD's 10Hz
push isn't reachable via polling, but ~1s is a normal cadence for SLM levels.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- Heartbeat: if nothing has been broadcast in MONITOR_HEARTBEAT_S (default
25s) — e.g. device offline and silent — send a non-cached keepalive frame
so a reverse proxy (NPM) doesn't drop the idle WS. New subscribers still
get the last real frame, not a heartbeat.
- Poller-skip: the 60s background poller now skips any unit with a running
monitor (MonitorManager.is_active). The monitor already polls it ~1Hz and
keeps the status cache fresh, so the background poll was redundant and just
added load/lock-contention on the device's single connection (and churn,
which matters for the cellular wedge). Trade-off: the FTP start-time sync
(only in the poller) doesn't run while a unit is actively monitored — fine,
since reports take the authoritative start time from the FTP .rnd data.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
For multiple clients connecting to a live feed (e.g. the client portal):
- cache the last broadcast frame and replay it to a new subscriber on
connect, so a client sees data immediately instead of waiting a full
poll cycle.
- broadcast a {"feed_status":"unreachable"} frame once on transition (after
3 consecutive poll failures) so clients can render an offline state
instead of a frozen chart; data frames now carry "feed_status":"ok".
The cached frame reflects current state, so a client connecting while
offline gets "unreachable" right away too.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replaces the POC single-threshold check with a real per-rule engine over
the live monitor feed.
- AlertRule / AlertEvent tables (auto-created via create_all; no migration).
Rule = {metric, comparison, threshold_db, duration_s, clear_margin_db,
schedule, channels, recipients}.
- alerts.py: per-(unit,rule) state machine IDLE->ACTIVE->IDLE with duration
debounce (both edges) + clear_margin hysteresis; onset/clear are distinct
events; optional nighttime schedule; rule cache w/ invalidation. The
state-machine core (_evaluate_step) is pure (no DB/clock) for testing.
- Dispatch is a server log (POC); _dispatch() is the seam for a Terra-View
webhook (email/SMS) later.
- CRUD: POST/GET/PUT/DELETE /{unit}/alerts/rules, GET /{unit}/alerts/events,
POST /{unit}/alerts/events/{id}/ack.
- test_alert_evaluator.py: synthetic level series proves onset debounce,
spike rejection, hysteresis hold, and below-comparison (4/4 pass, no device).
Source-agnostic: the same rules transfer unchanged if a unit's feed is later
sourced from FTP intervals instead of the DOD monitor.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The piece the live-view + alerting work was building toward.
monitor.py — one DOD poll loop per device, broadcast to many subscribers:
- browser WebSockets (fixes the single-connection "second viewer sees
nothing" contention — browsers no longer each open a device stream)
- the alert evaluator (can keep a feed running with no browser via
/monitor/start, so alerting runs continuously)
- persistence (each snapshot written like the poller)
DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls
go through the existing per-device lock + pool, so it serializes safely with
the background poller and on-demand commands.
alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds
ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule
(instantaneous vs sustained vs L10) is the single swap point; dispatch is a
server log for now (email/SMS later).
Endpoints:
- WS /api/nl43/{unit_id}/monitor subscribe to the shared feed
- POST /api/nl43/{unit_id}/monitor/start keep feed alive w/o a browser
- POST /api/nl43/{unit_id}/monitor/stop drop the keep-alive
- GET /api/nl43/_monitor/status running/subscribers/keepalive
WS endpoint races queue.get() against a disconnect watcher so an idle feed
still detects client drop and doesn't leak a subscription.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A buffer desync on the shared persistent connection (commonly right after
a DRD/DOD test) can make a Measure? read return a stray value. The state
classifier treated anything not in {"Start","Measure"} as "not measuring",
so a garbled read logged a phantom STOPPED, the next clean read logged
STARTED, and that reset measurement_start_time — producing constant
STOPPED/STARTED device-log pairs and a drifting elapsed timer.
Now only recognized states drive transitions: {"Start","Measure"} =
measuring, {"Stop"} = stopped, anything else = no change. Garbled reads
are also not persisted as the cached state, so they can't poison the next
transition check. Builds on the earlier Start<->Measure normalization.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lets an instance stop occupying a device's single TCP connection slot so
another instance (e.g. prod) can take over.
Per-unit:
- POST /api/nl43/{unit_id}/deactivate — poll_enabled=False (persisted) +
drop the connection (waits up to 10s for in-flight ops via the device
lock, then discards). Unit stays dormant across restarts.
- POST /api/nl43/{unit_id}/activate — re-enable polling.
Global standby:
- POST /api/nl43/_system/standby — poller idles and releases ALL
connections; the loop keeps re-releasing so the instance holds no slots.
- POST /api/nl43/_system/resume — resume polling.
- GET /api/nl43/_system/status — active vs standby + active_connections.
- SLMM_POLLING_ENABLED=false starts an instance in standby (persistent
way to keep a dev box from latching onto a prod-owned device).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
POST /api/nl43/{unit_id}/disconnect cleanly closes (TCP FIN + wait_closed)
and drops the pooled connection for a single device, freeing the NL43's
one connection slot. Previously only /_connections/flush existed, which
tears down every device at once.
Idempotent; no-op if nothing is cached. Releases the idle pooled
connection only — an active DRD stream/command has the socket checked out
of the pool, so close the stream WebSocket to end a live stream.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Completes the SLMM side of the L1/L10 live-display contract. The NL-43's
DOD response carries percentile slots LN1-LN5 (channel 1, parts[5]/[6]);
parse the first two and expose them as ln1/ln2 end to end:
- NL43Snapshot dataclass: ln1/ln2 fields
- NL43Status model: ln1/ln2 columns (+ migrate_add_ln_percentiles.py)
- DOD parser: snap.ln1=parts[5], snap.ln2=parts[6]
- persist_snapshot writes them
- all /status data dicts, StatusPayload, and the DRD stream payload emit
ln1/ln2 (null on the DRD stream itself, which doesn't carry percentiles)
Labels: device LN1 defaults to L5, not L1 — Terra-View defaults the label
to L1/L10, so the device's Ln1/Ln2 slots must be set to 1%/10% for the
labels to be accurate (dynamic label emission is a follow-up).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two device-data bugs surfaced while scoping the live-feed work:
1. DOD parser misalignment. DOD's response has no leading counter and
includes LE + LN1-LN5, but the parser reused the DRD field map
(parts[0]=counter). That shifted everything: Lp was stored as the
counter, Leq as Lp, LE as Leq, and LN1 as Lpeak (visible because
"Lpeak" came out below Lmax, which is impossible). Parse DOD with its
own map: Lp=0, Leq=1, Lmax=3, Lmin=4, Lpeak=10 (channel 1 = main).
2. measurement_start_time reset on every live-stream open/close. The DOD
path tags state "Start"; the DRD stream path tags "Measure". The
transition detector treated only "Start" as measuring, so opening the
stream ("Start"->"Measure") read as a stop (cleared start time) and
closing it ("Measure"->"Start") read as a start (reset to now). Every
viewer reset the elapsed measurement time. Treat {"Start","Measure"}
both as measuring.
LN1/LN2 (L1/L10) parsing + model/serialization is the next step.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
stream_drd() discarded the pooled connection and forced a fresh connect.
The NL43 allows only one TCP connection at a time; over a cellular link
the device does not free its single slot fast enough for an immediate
reconnect, so the fresh connect times out — the live DRD stream fails
while start/stop commands (which reuse the warm pooled socket) keep
working. This surfaced once the persistent connection pool was enabled
(TCP_PERSISTENT_ENABLED=true).
Stream over the already-open pooled connection via acquire() instead of
discard()+_open_connection(), and release() it back to the pool on exit
(after sending SUB to stop the stream) so commands keep reusing the same
single socket. The per-device lock is held for the whole streaming
session, so the poller can't touch the socket concurrently.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Modern Starlette requires `request` as the first positional arg to
TemplateResponse. The old `TemplateResponse(name, context)` form caused
the context dict to be passed as the template name, which Jinja2 then
tried to use as a cache key -> TypeError: unhashable type: 'dict' (500
on GET / and /roster).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- db cache dump on diagnostics request.
- individual device logs, db and files.
-Device logs api endpoints and diagnostics UI.
Fix:
- slmm standalone now uses local TZ (was UTC only before)
- fixed measurement start time logic.
- Implemented a new `/roster` endpoint to retrieve and manage device configurations.
- Added HTML template for the roster page with a table to display device status and actions.
- Introduced functionality to add, edit, and delete devices via the roster interface.
- Enhanced `ConfigPayload` model to include polling options.
- Updated the main application to serve the new roster page and link to it from the index.
- Added validation for polling interval in the configuration payload.
- Created detailed documentation for the roster management features and API endpoints.
- Introduced a new communication guide detailing protocol basics, transport modes, and a quick startup checklist.
- Added a detailed list of commands with their functions and usage for NL-43/NL-53 devices.
- Created a verified quick reference for command formats to prevent common mistakes.
- Implemented an improvements document outlining critical fixes, security enhancements, reliability upgrades, and code quality improvements for the SLMM project.
- Enhanced the frontend with a new button to retrieve all device settings, along with corresponding JavaScript functionality.
- Added a test script for the new settings retrieval API endpoint to demonstrate its usage and validate functionality.
- Implement migration script to add ftp_username and ftp_password columns to nl43_config table.
- Create set_ftp_credentials.py script for updating FTP credentials in the database.
- Update requirements.txt to include aioftp for FTP functionality.
- Enhance index.html with FTP controls including enable, disable, check status, and list files features.
- Add JavaScript functions for handling FTP operations and displaying file lists.