55 Commits

Author SHA1 Message Date
serversdown 89b6892656 Merge pull request 'Update to v 0.4.0' (#6) from dev into main
Reviewed-on: #6
2026-06-22 18:07:36 -04:00
serversdown 43b8e53d2d chore: version bump 2026-06-22 20:54:43 +00:00
serversdown 6d1c426ee4 fix: recognize 'Start' state when confirming measurement start
The /start handler waited for measurement_state == "Measure", but the
device reports "Start" while measuring. The confirmation check therefore
never matched, so the post-start status loop always ran its full 3x DOD
retry cycle over cellular, pushing the call past ~10s. That blew past the
Terra-View proxy's request timeout and surfaced to users as a misleading
"Unknown error" even though the unit had already started recording.

Match the device's actual reported state (and stay consistent with
persist_snapshot's MEASURING_STATES handling) so /start confirms on the
first attempt and returns promptly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-21 20:22:39 +00:00
serversdown ad6071b790 fix(alerts): reset rule state + close open event on rule edit/delete
invalidate() only dropped the rule cache, not the per-(unit,rule) state machine —
so editing a rule's metric/threshold left a stale 'active' phase that mis-evaluated
against the new config (spurious clear, or suppressed onset), and deleting an
in-alarm rule left an open AlertEvent that kept the client portal stuck "in alarm"
forever. update/delete now call _reset_rule_runtime: forget_rule() drops the state
machine and any open event for that rule is closed.

Verified: existing evaluator tests + cooldown scenario still pass; compiles.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 23:40:52 +00:00
serversdown cfdeada9d6 fix(alerts): enforce cooldown_s between onsets
cooldown_s was stored + shown in the UI but never read, so a repeatedly-breaching
signal (e.g. intermittent traffic noise) would flood the alert history with an
event per spike. The evaluator now suppresses a new onset within cooldown_s of the
last, holding the edge so it fires the moment the window lapses if still breaching.
Hysteresis still gates clears. getattr-guarded so partial rule fixtures don't crash.

Verified: existing 4 evaluator tests pass; cooldown scenario (onset → clear →
suppressed re-breach → onset after window) passes.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 22:47:39 +00:00
serversdown b51fefca2b feat(alerts): enabled rules pin the monitor on (24/7 evaluation)
The evaluator only runs inside the monitor loop, so a rule on an idle device
never fired. Now creating/updating/deleting an alert rule calls
_sync_keepalive_to_rules: if the unit has any enabled rule, persist
NL43Config.monitor_enabled=True (so the boot auto-start re-enables it after a
restart) and turn on runtime keepalive. Never auto-OFF — a device may be kept
alive for other reasons; operators control that on /admin/slmm. Alert CRUD
endpoints are now async to await the monitor manager.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 19:36:16 +00:00
serversdown 5bc542e92f fix(monitor): quiet send-after-close race on WS disconnect
When a monitor subscriber disconnects mid-frame (the client portal closes its
stream on every tab switch via the Page Visibility guard), the loop could pull a
queued payload during the 1s wait and then send_json into an already-closing
socket -> "Unexpected ASGI message 'websocket.send' after ... websocket.close",
logged as a WARNING on every disconnect.

Re-check gone.done() after the queue wait and break before sending; treat the
residual send-after-close as expected (debug, not warning). No behavior change —
the connection was already closing as intended; this just stops the log spam.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-11 03:29:16 +00:00
serversdown 1f5f1fb1f6 feat(monitor): adaptive poll rate, unreachable backoff, device-offline alert
Three changes to cut wasted device/cellular load and surface outages:

- Adaptive interval: full-rate (~1.25s) while a browser is subscribed for a
  smooth chart; relaxed cadence (MONITOR_IDLE_POLL_INTERVAL, default 10s) when
  the feed is keepalive-only (alerting). ~8x fewer polls with no viewer ->
  ~8x less cellular traffic on a metered SIM. Note: idle interval also sets the
  alert sampling resolution when nobody is watching.
- Exponential backoff when the device is unreachable (1->2->...->60s cap),
  reset on the first good poll, so a dead/asleep device stops churning
  reconnects (log spam + wasted SYN traffic). Capped at 5s while a browser is
  watching so a recovery still surfaces quickly.
- Device-offline alert: the reachable->unreachable transition raises a
  connectivity AlertEvent (sentinel rule_id=0, metric="connectivity") through
  the existing evaluator/dispatch seam; recovery clears it. Deduped in memory
  and via the DB (so a restart mid-outage doesn't duplicate the event).

MonitorManager.status() now reports reachable + current mode (watched/idle/
backoff) for observability.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-10 06:47:20 +00:00
serversdown b4cea2f287 feat: include measurement_start_time in cached /status response
So consumers (e.g. the command center) can read the elapsed-time clock from
the cached status instead of a fresh device /live read. Added to both the
GET and POST /status data dicts.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 22:57:45 +00:00
serversdown d1d694302c feat: downsampled DOD trail + history endpoint for live-chart backfill
So a viewer sees recent trend on open instead of a blank chart. Viewing
only — reports still use the device's FTP .rnd data.

- NL43Reading table (auto-creates; no migration): unit_id, timestamp,
  lp/leq/lmax/ln1/ln2.
- Monitor stores one downsampled reading per MONITOR_TRAIL_SAMPLE_S
  (default 60s) from its keepalive poll loop, pruning rows older than
  MONITOR_TRAIL_RETENTION_HOURS (default 24h). ~1440 rows/unit max.
- GET /api/nl43/{unit}/history?hours=N -> the trail for the last N hours
  (clamped 0.1-48h), oldest-first.

Because keepalive runs 24/7, the trail fills continuously, so the history
is there whenever someone opens the live view.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 19:58:30 +00:00
serversdown 43e72ae3c3 feat: persistent monitor_enabled flag + auto-start keepalive on boot
Makes live monitoring (and therefore alerting) genuinely 24/7 and
restart-surviving, instead of runtime-only keepalive.

- NL43Config.monitor_enabled (default True) + migrate_add_monitor_enabled.py.
- On startup, auto-start keepalive monitors for every monitor_enabled +
  tcp_enabled unit — so feeds/alerts resume after a restart with no manual step.
- /monitor/start and /monitor/stop now PERSIST monitor_enabled (start=True,
  stop=False) in addition to applying keepalive at runtime, so the toggle
  sticks. Roster output includes monitor_enabled for the admin UI to read.

On by default: configure a unit -> it's monitored 24/7 unless toggled off.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 19:27:25 +00:00
serversdown 9d34779171 perf: monitor caches run state, ~halving live-feed latency
Each monitor poll was sending DOD? + Measure? (two commands), and the NL43
enforces >=1s between commands, so updates were ~2.5s apart. The run state
changes rarely, so cache it and refresh via Measure? only every
MONITOR_STATE_REFRESH_S (default 30s); most polls now send just DOD? (one
rate-limited command) -> ~1.3s/update. Also trim MONITOR_POLL_INTERVAL to
0.25s since the device rate-limit is the real pacer.

request_dod() gains an optional measurement_state arg: when supplied it
reuses that state and skips the Measure? round-trip; None preserves the old
query-every-time behavior.

~1Hz is the device floor for DOD (the >=1s command spacing); DRD's 10Hz
push isn't reachable via polling, but ~1s is a normal cadence for SLM levels.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 18:52:13 +00:00
serversdown 87c06f1519 Merge pull request 'merge drd-fix into dev' (#5) from feat/drd-fix into dev
Reviewed-on: #5
2026-06-09 14:21:16 -04:00
serversdown ba622c67d8 feat: monitor heartbeat + background poller skips active-monitored units
- Heartbeat: if nothing has been broadcast in MONITOR_HEARTBEAT_S (default
  25s) — e.g. device offline and silent — send a non-cached keepalive frame
  so a reverse proxy (NPM) doesn't drop the idle WS. New subscribers still
  get the last real frame, not a heartbeat.
- Poller-skip: the 60s background poller now skips any unit with a running
  monitor (MonitorManager.is_active). The monitor already polls it ~1Hz and
  keeps the status cache fresh, so the background poll was redundant and just
  added load/lock-contention on the device's single connection (and churn,
  which matters for the cellular wedge). Trade-off: the FTP start-time sync
  (only in the poller) doesn't run while a unit is actively monitored — fine,
  since reports take the authoritative start time from the FTP .rnd data.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:33:29 +00:00
serversdown 6b1ec75396 feat: harden fan-out for live clients — instant first frame + offline status
For multiple clients connecting to a live feed (e.g. the client portal):
- cache the last broadcast frame and replay it to a new subscriber on
  connect, so a client sees data immediately instead of waiting a full
  poll cycle.
- broadcast a {"feed_status":"unreachable"} frame once on transition (after
  3 consecutive poll failures) so clients can render an offline state
  instead of a frozen chart; data frames now carry "feed_status":"ok".
  The cached frame reflects current state, so a client connecting while
  offline gets "unreachable" right away too.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 17:13:21 +00:00
serversdown 9c43e68534 feat: alert engine stage 1 — rules, events, state machine, CRUD
Replaces the POC single-threshold check with a real per-rule engine over
the live monitor feed.

- AlertRule / AlertEvent tables (auto-created via create_all; no migration).
  Rule = {metric, comparison, threshold_db, duration_s, clear_margin_db,
  schedule, channels, recipients}.
- alerts.py: per-(unit,rule) state machine IDLE->ACTIVE->IDLE with duration
  debounce (both edges) + clear_margin hysteresis; onset/clear are distinct
  events; optional nighttime schedule; rule cache w/ invalidation. The
  state-machine core (_evaluate_step) is pure (no DB/clock) for testing.
- Dispatch is a server log (POC); _dispatch() is the seam for a Terra-View
  webhook (email/SMS) later.
- CRUD: POST/GET/PUT/DELETE /{unit}/alerts/rules, GET /{unit}/alerts/events,
  POST /{unit}/alerts/events/{id}/ack.
- test_alert_evaluator.py: synthetic level series proves onset debounce,
  spike rejection, hysteresis hold, and below-comparison (4/4 pass, no device).

Source-agnostic: the same rules transfer unchanged if a unit's feed is later
sourced from FTP intervals instead of the DOD monitor.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-09 01:04:03 +00:00
serversdown aa3e088b64 feat: per-device live monitor (fan-out) + alert evaluator (POC)
The piece the live-view + alerting work was building toward.

monitor.py — one DOD poll loop per device, broadcast to many subscribers:
- browser WebSockets (fixes the single-connection "second viewer sees
  nothing" contention — browsers no longer each open a device stream)
- the alert evaluator (can keep a feed running with no browser via
  /monitor/start, so alerting runs continuously)
- persistence (each snapshot written like the poller)
DOD-sourced, so the broadcast carries ln1/ln2 (which DRD cannot). All polls
go through the existing per-device lock + pool, so it serializes safely with
the background poller and on-demand commands.

alerts.py — pluggable POC evaluator: fires (logs) when ALERT_METRIC exceeds
ALERT_THRESHOLD_DB with an ALERT_COOLDOWN_SECONDS cooldown. The rule
(instantaneous vs sustained vs L10) is the single swap point; dispatch is a
server log for now (email/SMS later).

Endpoints:
- WS   /api/nl43/{unit_id}/monitor          subscribe to the shared feed
- POST /api/nl43/{unit_id}/monitor/start    keep feed alive w/o a browser
- POST /api/nl43/{unit_id}/monitor/stop     drop the keep-alive
- GET  /api/nl43/_monitor/status            running/subscribers/keepalive

WS endpoint races queue.get() against a disconnect watcher so an idle feed
still detects client drop and doesn't leak a subscription.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 23:27:05 +00:00
serversdown 8c17af4849 fix: ignore garbled measurement-state reads (phantom STOPPED/STARTED)
A buffer desync on the shared persistent connection (commonly right after
a DRD/DOD test) can make a Measure? read return a stray value. The state
classifier treated anything not in {"Start","Measure"} as "not measuring",
so a garbled read logged a phantom STOPPED, the next clean read logged
STARTED, and that reset measurement_start_time — producing constant
STOPPED/STARTED device-log pairs and a drifting elapsed timer.

Now only recognized states drive transitions: {"Start","Measure"} =
measuring, {"Stop"} = stopped, anything else = no change. Garbled reads
are also not persisted as the cached state, so they can't poison the next
transition check. Builds on the earlier Start<->Measure normalization.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 22:50:18 +00:00
serversdown b954eb8c89 feat: per-unit deactivate and global SLMM standby
Lets an instance stop occupying a device's single TCP connection slot so
another instance (e.g. prod) can take over.

Per-unit:
- POST /api/nl43/{unit_id}/deactivate — poll_enabled=False (persisted) +
  drop the connection (waits up to 10s for in-flight ops via the device
  lock, then discards). Unit stays dormant across restarts.
- POST /api/nl43/{unit_id}/activate — re-enable polling.

Global standby:
- POST /api/nl43/_system/standby — poller idles and releases ALL
  connections; the loop keeps re-releasing so the instance holds no slots.
- POST /api/nl43/_system/resume — resume polling.
- GET  /api/nl43/_system/status — active vs standby + active_connections.
- SLMM_POLLING_ENABLED=false starts an instance in standby (persistent
  way to keep a dev box from latching onto a prod-owned device).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 22:45:52 +00:00
serversdown 0793e7df01 feat: add per-device disconnect endpoint
POST /api/nl43/{unit_id}/disconnect cleanly closes (TCP FIN + wait_closed)
and drops the pooled connection for a single device, freeing the NL43's
one connection slot. Previously only /_connections/flush existed, which
tears down every device at once.

Idempotent; no-op if nothing is cached. Releases the idle pooled
connection only — an active DRD stream/command has the socket checked out
of the pool, so close the stream WebSocket to end a live stream.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 22:40:56 +00:00
serversdown 51dd6b682d feat: surface LN1/LN2 (L1/L10) percentiles through SLMM
Completes the SLMM side of the L1/L10 live-display contract. The NL-43's
DOD response carries percentile slots LN1-LN5 (channel 1, parts[5]/[6]);
parse the first two and expose them as ln1/ln2 end to end:

- NL43Snapshot dataclass: ln1/ln2 fields
- NL43Status model: ln1/ln2 columns (+ migrate_add_ln_percentiles.py)
- DOD parser: snap.ln1=parts[5], snap.ln2=parts[6]
- persist_snapshot writes them
- all /status data dicts, StatusPayload, and the DRD stream payload emit
  ln1/ln2 (null on the DRD stream itself, which doesn't carry percentiles)

Labels: device LN1 defaults to L5, not L1 — Terra-View defaults the label
to L1/L10, so the device's Ln1/Ln2 slots must be set to 1%/10% for the
labels to be accurate (dynamic label emission is a follow-up).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 22:01:31 +00:00
serversdown a7983d2958 fix: correct DOD field parsing and stop measurement-time resets
Two device-data bugs surfaced while scoping the live-feed work:

1. DOD parser misalignment. DOD's response has no leading counter and
   includes LE + LN1-LN5, but the parser reused the DRD field map
   (parts[0]=counter). That shifted everything: Lp was stored as the
   counter, Leq as Lp, LE as Leq, and LN1 as Lpeak (visible because
   "Lpeak" came out below Lmax, which is impossible). Parse DOD with its
   own map: Lp=0, Leq=1, Lmax=3, Lmin=4, Lpeak=10 (channel 1 = main).

2. measurement_start_time reset on every live-stream open/close. The DOD
   path tags state "Start"; the DRD stream path tags "Measure". The
   transition detector treated only "Start" as measuring, so opening the
   stream ("Start"->"Measure") read as a stop (cleared start time) and
   closing it ("Measure"->"Start") read as a start (reset to now). Every
   viewer reset the elapsed measurement time. Treat {"Start","Measure"}
   both as measuring.

LN1/LN2 (L1/L10) parsing + model/serialization is the next step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 21:53:00 +00:00
serversdown d6dd2e736b Merge pull request 'fix: improve connection pool idle and max age checks to allow disabling' (#3) from dev-persistent into main
Reviewed-on: #3
2026-06-08 16:56:33 -04:00
serversdown af86cf713e fix: reuse pooled TCP connection for DRD streaming
stream_drd() discarded the pooled connection and forced a fresh connect.
The NL43 allows only one TCP connection at a time; over a cellular link
the device does not free its single slot fast enough for an immediate
reconnect, so the fresh connect times out — the live DRD stream fails
while start/stop commands (which reuse the warm pooled socket) keep
working. This surfaced once the persistent connection pool was enabled
(TCP_PERSISTENT_ENABLED=true).

Stream over the already-open pooled connection via acquire() instead of
discard()+_open_connection(), and release() it back to the pool on exit
(after sending SUB to stop the stream) so commands keep reusing the same
single socket. The per-device lock is held for the whole streaming
session, so the poller can't touch the socket concurrently.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 19:00:35 +00:00
serversdown e3f9ca7f5b fix: use request-first TemplateResponse signature
Modern Starlette requires `request` as the first positional arg to
TemplateResponse. The old `TemplateResponse(name, context)` form caused
the context dict to be passed as the template name, which Jinja2 then
tried to use as a cache key -> TypeError: unhashable type: 'dict' (500
on GET / and /roster).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-08 17:59:39 +00:00
serversdown 450509d210 stop tracking dev runtime data 2026-03-12 22:46:37 +00:00
serversdown fefa9eace8 chore: gitignore clean up 2026-03-12 21:34:14 +00:00
serversdown 98a8d357e5 chore: data-dev folder added to gitignore 2026-03-12 21:33:43 +00:00
claude 0a7422eceb Merge branch 'dev-persistent' of ssh://10.0.0.2:2222/serversdown/slmm into dev-persistent 2026-03-12 20:26:56 +00:00
claude 996b993cb9 chore: gitignore dev data 2026-03-12 20:26:53 +00:00
claude 01337696b3 feat: add connection pool status logging every 15 minutes 2026-02-19 15:09:50 +00:00
claude a302fd15d4 fix: change debug logs to info level for connection pool events 2026-02-19 06:04:34 +00:00
claude af5ecc1a92 fix: improve connection pool idle and max age checks to allow disabling 2026-02-19 01:25:01 +00:00
serversdown ad1a40e0aa Merge pull request 'v0.3.0, persistent polling update. Persistent TCP connection pool with all features Connection pool diagnostics (API + UI) All 6 new environment variables Changes to health check, diagnostics, and DRD streaming Technical architecture details and cellular' (#2) from dev-persistent into main
Reviewed-on: #2
2026-02-16 21:57:37 -05:00
claude b62e84f8b3 v0.3.0, persistent polling update. 2026-02-17 02:56:11 +00:00
claude a5f8d1b2c7 Persistent polling interval increased. Healthcheck now uses poll instead of separate handshakes. 2026-02-17 02:41:09 +00:00
claude a1a80bbb4d add: new persisent connection approach, env variables for tcp keepalive and persist, added connection pool class. 2026-02-16 04:25:51 +00:00
claude 005e0091fe fix: delay added to ensure tcp commands dont talk over eachother 2026-02-16 02:42:41 +00:00
claude e6ac80df6c chore: add pcap files to gitignore 2026-02-10 21:12:19 +00:00
claude 7070b948a8 add: stress test script for diagnosing TCP connection issues.
chore: clean up .gitignore
2026-02-10 07:07:34 +00:00
claude 3b6e9ad3f0 fix: time added to FTP enable step to prevent commands getting messed up 2026-02-06 17:37:10 +00:00
claude eb0cbcc077 fix: 24hr restart schedule enchanced.
Step 0: Pause polling
Step 1: Stop measurement → wait 10s
Step 2: Disable FTP → wait 10s
Step 3: Enable FTP → wait 10s
Step 4: Download data
Step 5: Wait 30s for device to settle
Step 6: Start new measurement
Step 7: Re-enable polling
2026-01-31 05:15:00 +00:00
claude cc0a5bdf84 chore cleanup 2026-01-29 22:44:20 +00:00
claude bf5f222511 Add:
- db cache dump on diagnostics request.
- individual device logs, db and files.
-Device logs api endpoints and diagnostics UI.

Fix:
- slmm standalone now uses local TZ (was UTC only before)
- fixed measurement start time logic.
2026-01-29 18:50:47 +00:00
claude eb39a9d1d0 add: device communication lock, Now to send a tcp command, slmm must establish a connection lock to prevent flooding unit.
fixed: Background poller intervals increased.
2026-01-29 07:54:49 +00:00
claude 67d63b4173 Merge branch 'main' of ssh://10.0.0.2:2222/serversdown/slmm 2026-01-23 08:29:27 +00:00
claude 25cf9528d0 docs: update to 0.2.1 2026-01-23 08:26:23 +00:00
serversdown 738ad7878e doc update 2026-01-22 15:30:06 -05:00
claude 152377d608 feat: terra-view scheduler implementation added. Start_cylce and stop_cycle functions added. 2026-01-22 20:25:47 +00:00
claude 4868381053 Enhance FTP logging with detailed phases for connection, authentication, and data transfer 2026-01-21 08:05:38 +00:00
claude b4bbfd2b01 chore:fixed api.md to confirm FTP/TCP interactions are working. 2026-01-17 08:13:19 +00:00
claude 82651f71b5 Add roster management interface and related API endpoints
- Implemented a new `/roster` endpoint to retrieve and manage device configurations.
- Added HTML template for the roster page with a table to display device status and actions.
- Introduced functionality to add, edit, and delete devices via the roster interface.
- Enhanced `ConfigPayload` model to include polling options.
- Updated the main application to serve the new roster page and link to it from the index.
- Added validation for polling interval in the configuration payload.
- Created detailed documentation for the roster management features and API endpoints.
2026-01-17 08:00:05 +00:00
claude 182920809d chore: docs and scripts organized. clutter cleared. 2026-01-16 19:06:38 +00:00
claude 2a3589ca5c Add endpoint to delete device configuration and associated status data 2026-01-16 07:39:26 +00:00
claude d43ef7427f v0.2.0: async status polling added. 2026-01-16 06:24:13 +00:00
38 changed files with 8096 additions and 455 deletions
+5
View File
@@ -1,5 +1,8 @@
/manuals/ /manuals/
/data/ /data/
/data-dev/
/SLM-stress-test/stress_test_logs/
/SLM-stress-test/tcpdump-runs/
# Python cache # Python cache
__pycache__/ __pycache__/
@@ -12,3 +15,5 @@ __pycache__/
*.egg-info/ *.egg-info/
dist/ dist/
build/ build/
*.pcap
+251
View File
@@ -0,0 +1,251 @@
# Changelog
All notable changes to SLMM (Sound Level Meter Manager) will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
## [0.4.0] - 2026-06-22
### Added
#### Live Monitor (fan-out feed)
- **Per-device fan-out monitor** - one shared, cached live feed per device. Multiple clients (dashboards, portal, charts) subscribe to the same stream instead of each fighting for the NL-43's single TCP connection: one poller reads the device, all subscribers get the same frames.
- **WebSocket monitor** - `WS /api/nl43/{unit_id}/monitor` delivers an instant first frame from cache, then live updates.
- **Monitor control** - `POST /api/nl43/{unit_id}/monitor/{start|stop}`, `GET /api/nl43/_monitor/status`. A persistent `monitor_enabled` flag auto-starts the keepalive on boot.
- **Adaptive polling** - poll rate adapts to demand; unreachable devices back off; a device-offline alert fires when a monitored unit drops.
- **De-duplication** - the background poller skips units already covered by an active monitor (no double-polling); a heartbeat keeps the feed warm.
- **Lower latency** - the monitor caches run state, roughly halving live-feed latency; fan-out emits an instant first frame + offline status to new clients.
#### Alert Engine
- **Threshold rules** - per-device alert rules (metric + threshold + cooldown) with full CRUD: `POST/GET/PUT/DELETE /api/nl43/{unit_id}/alerts/rules[/{rule_id}]`.
- **Events + state machine** - onset/clear tracking via `GET /api/nl43/{unit_id}/alerts/events`; acknowledge with `POST .../events/{event_id}/ack`. A `cooldown_s` is enforced between onsets.
- **24/7 evaluation** - enabled rules pin the monitor on, so rules evaluate continuously even with no UI client connected.
- **Resilience** - editing or deleting a rule resets its state and closes any open event; device-offline events are raised when a monitored unit goes unreachable.
#### Data & History
- **Live-chart backfill** - a downsampled DOD trail is persisted to a new `nl43_readings` table, exposed via `GET /api/nl43/{unit_id}/history` so charts can backfill recent history on load.
- **LN1/LN2 percentiles** - L1/L10 (configurable percentiles) surfaced through SLMM in the status and live-feed payloads.
- **measurement_start_time** included in the cached `/status` response.
#### Device control
- **Per-device disconnect** - `POST /api/nl43/{unit_id}/disconnect` drops a device's pooled connection.
- **Deactivate / standby** - `POST /api/nl43/{unit_id}/deactivate` and global `POST /api/nl43/_system/standby` to quiesce polling/monitoring.
### Changed
- **DRD streaming reuses the pooled connection** rather than opening a separate socket, avoiding contention with the persistent pool on a single-connection device.
- **Connection pool** - idle-TTL / max-age checks can now be disabled; pool status is logged periodically.
### Fixed
- **Measurement-start confirmation** - `/start` now recognizes the device's `Start` state. It previously waited for `Measure`, which never matched, so the start cycle ran the full retry loop and Terra-View's proxy timed out with a misleading "Unknown error" even though the device had started.
- **Garbled reads** - corrupted measurement-state reads that produced phantom STOPPED/STARTED transitions are now ignored.
- **DOD parsing** - corrected field parsing and stopped spurious measurement-time resets.
- **Monitor WebSocket** - quieted a send-after-close race on client disconnect.
### Database
- **New tables** (auto-created on startup via `Base.metadata.create_all`): `alert_rules`, `alert_events`, `nl43_readings`.
- **Migrations for existing tables** (run once per database): `migrate_add_ln_percentiles.py` (LN1/LN2 on `nl43_status`), `migrate_add_monitor_enabled.py` (`monitor_enabled` on `nl43_config`).
### Notes
- Pairs with the matching Terra-View `dev` build, which reads SLMM's `/monitor` fan-out feed for live SLM dashboards (L1/L10 lines, live-chart backfill). Ship the two together.
---
## [0.3.0] - 2026-02-17
### Added
#### Persistent TCP Connection Pool
- **Connection reuse** - TCP connections are cached per device and reused across commands, eliminating repeated TCP handshakes over cellular modems
- **OS-level TCP keepalive** - Configurable keepalive probes keep cellular NAT tables alive and detect dead connections early (default: probe after 15s idle, every 10s, 3 failures = dead)
- **Transparent retry** - If a cached connection goes stale, the system automatically retries with a fresh connection so failures are never visible to the caller
- **Stale connection detection** - Multi-layer detection via idle TTL, max age, transport state, and reader EOF checks
- **Background cleanup** - Periodic task (every 30s) evicts expired connections from the pool
- **Master switch** - Set `TCP_PERSISTENT_ENABLED=false` to revert to per-request connection behavior
#### Connection Pool Diagnostics
- `GET /api/nl43/_connections/status` - View pool configuration, active connections, age/idle times, and keepalive settings
- `POST /api/nl43/_connections/flush` - Force-close all cached connections (useful for debugging)
- **Connections tab on roster page** - Live UI showing pool config, active connections with age/idle/alive status, auto-refreshes every 5s, and flush button
#### Environment Variables
- `TCP_PERSISTENT_ENABLED` (default: `true`) - Master switch for persistent connections
- `TCP_IDLE_TTL` (default: `300`) - Close idle connections after N seconds
- `TCP_MAX_AGE` (default: `1800`) - Force reconnect after N seconds
- `TCP_KEEPALIVE_IDLE` (default: `15`) - Seconds idle before keepalive probes start
- `TCP_KEEPALIVE_INTERVAL` (default: `10`) - Seconds between keepalive probes
- `TCP_KEEPALIVE_COUNT` (default: `3`) - Failed probes before declaring connection dead
### Changed
- **Health check endpoint** (`/health/devices`) - Now uses connection pool instead of opening throwaway TCP connections; checks for existing live connections first (zero-cost), only opens new connection through pool if needed
- **Diagnostics endpoint** - Removed separate port 443 modem check (extra handshake waste); TCP reachability test now uses connection pool
- **DRD streaming** - Streaming connections now get TCP keepalive options set; cached connections are evicted before opening dedicated streaming socket
- **Default timeouts tuned for cellular** - Idle TTL raised to 300s (5 min), max age raised to 1800s (30 min) to survive typical polling intervals over cellular links
### Technical Details
#### Architecture
- `ConnectionPool` class in `services.py` manages a single cached connection per device key (NL-43 only supports one TCP connection at a time)
- Uses existing per-device asyncio locks and rate limiting — no changes to concurrency model
- Pool is a module-level singleton initialized from environment variables at import time
- Lifecycle managed via FastAPI lifespan: cleanup task starts on startup, all connections closed on shutdown
- `_send_command_unlocked()` refactored to use acquire/release/discard pattern with single-retry fallback
- Command parsing extracted to `_execute_command()` method for reuse between primary and retry paths
#### Cellular Modem Optimizations
- Keepalive probes at 15s prevent cellular NAT tables from expiring (typically 30-60s timeout)
- 300s idle TTL ensures connections survive between polling cycles (default 60s interval)
- 1800s max age allows a single socket to serve ~30 minutes of polling before forced reconnect
- Health checks and diagnostics produce zero additional TCP handshakes when a pooled connection exists
- Stale `$` prompt bytes drained from idle connections before command reuse
### Breaking Changes
None. This release is fully backward-compatible with v0.2.x. Set `TCP_PERSISTENT_ENABLED=false` for identical behavior to previous versions.
---
## [0.2.1] - 2026-01-23
### Added
- **Roster management**: UI and API endpoints for managing device rosters.
- **Delete config endpoint**: Remove device configuration alongside cached status data.
- **Scheduler hooks**: `start_cycle` and `stop_cycle` helpers for Terra-View scheduling integration.
### Changed
- **FTP logging**: Connection, authentication, and transfer phases now log explicitly.
- **Documentation**: Reorganized docs/scripts and updated API notes for FTP/TCP verification.
## [0.2.0] - 2026-01-15
### Added
#### Background Polling System
- **Continuous automatic device polling** - Background service that continuously polls configured devices
- **Per-device configurable intervals** - Each device can have custom polling interval (10-3600 seconds, default 60)
- **Automatic offline detection** - Devices automatically marked unreachable after 3 consecutive failures
- **Reachability tracking** - Database fields track device health with failure counters and error messages
- **Dynamic sleep scheduling** - Polling service adjusts sleep intervals based on device configurations
- **Graceful lifecycle management** - Background poller starts on application startup and stops cleanly on shutdown
#### New API Endpoints
- `GET /api/nl43/{unit_id}/polling/config` - Get device polling configuration
- `PUT /api/nl43/{unit_id}/polling/config` - Update polling interval and enable/disable per-device polling
- `GET /api/nl43/_polling/status` - Get global polling status for all devices with reachability info
#### Database Schema Changes
- **NL43Config table**:
- `poll_interval_seconds` (Integer, default 60) - Polling interval in seconds
- `poll_enabled` (Boolean, default true) - Enable/disable background polling per device
- **NL43Status table**:
- `is_reachable` (Boolean, default true) - Current device reachability status
- `consecutive_failures` (Integer, default 0) - Count of consecutive poll failures
- `last_poll_attempt` (DateTime) - Last time background poller attempted to poll
- `last_success` (DateTime) - Last successful poll timestamp
- `last_error` (Text) - Last error message (truncated to 500 chars)
#### New Files
- `app/background_poller.py` - Background polling service implementation
- `migrate_add_polling_fields.py` - Database migration script for v0.2.0 schema changes
- `test_polling.sh` - Comprehensive test script for polling functionality
- `CHANGELOG.md` - This changelog file
### Changed
- **Enhanced status endpoint** - `GET /api/nl43/{unit_id}/status` now includes polling-related fields (is_reachable, consecutive_failures, last_poll_attempt, last_success, last_error)
- **Application startup** - Added lifespan context manager in `app/main.py` to manage background poller lifecycle
- **Performance improvement** - Terra-View requests now return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds)
### Technical Details
#### Architecture
- Background poller runs as async task using `asyncio.create_task()`
- Uses existing `NL43Client` and `persist_snapshot()` functions - no code duplication
- Respects existing 1-second rate limiting per device
- Efficient resource usage - skips work when no devices configured
- WebSocket streaming remains unaffected - separate real-time data path
#### Default Behavior
- Existing devices automatically get 60-second polling interval
- Existing status records default to `is_reachable=true`
- Migration is additive-only - no data loss
- Polling can be disabled per-device via `poll_enabled=false`
#### Recommended Intervals
- Critical monitoring: 30 seconds
- Normal monitoring: 60 seconds (default)
- Battery conservation: 300 seconds (5 minutes)
- Development/testing: 10 seconds (minimum allowed)
### Migration Notes
To upgrade from v0.1.x to v0.2.0:
1. **Stop the service** (if running):
```bash
docker compose down slmm
# OR
# Stop your uvicorn process
```
2. **Update code**:
```bash
git pull
# OR copy new files
```
3. **Run migration**:
```bash
cd slmm
python3 migrate_add_polling_fields.py
```
4. **Restart service**:
```bash
docker compose up -d --build slmm
# OR
uvicorn app.main:app --host 0.0.0.0 --port 8100
```
5. **Verify polling is active**:
```bash
curl http://localhost:8100/api/nl43/_polling/status | jq '.'
```
You should see `"poller_running": true` and all configured devices listed.
### Breaking Changes
None. This release is fully backward-compatible with v0.1.x. All existing endpoints and functionality remain unchanged.
---
## [0.1.0] - 2025-12-XX
### Added
- Initial release
- REST API for NL43/NL53 sound level meter control
- TCP command protocol implementation
- FTP file download support
- WebSocket streaming for real-time data (DRD)
- Device configuration management
- Measurement control (start, stop, pause, resume, reset, store)
- Device information endpoints (battery, clock, results)
- Measurement settings management (frequency/time weighting)
- Sleep mode control
- Rate limiting (1-second minimum between commands)
- SQLite database for device configs and status cache
- Health check endpoints
- Comprehensive API documentation
- NL43 protocol documentation
### Database Schema (v0.1.0)
- **NL43Config table** - Device connection configuration
- **NL43Status table** - Measurement snapshot cache
---
## Version History Summary
- **v0.3.0** (2026-02-17) - Persistent TCP connections with keepalive for cellular modem reliability
- **v0.2.1** (2026-01-23) - Roster management, scheduler hooks, FTP logging, doc cleanup
- **v0.2.0** (2026-01-15) - Background Polling System
- **v0.1.0** (2025-12-XX) - Initial Release
+214 -10
View File
@@ -1,15 +1,23 @@
# SLMM - Sound Level Meter Manager # SLMM - Sound Level Meter Manager
**Version 0.4.0**
Backend API service for controlling and monitoring Rion NL-43/NL-53 Sound Level Meters via TCP and FTP protocols. Backend API service for controlling and monitoring Rion NL-43/NL-53 Sound Level Meters via TCP and FTP protocols.
## Overview ## Overview
SLMM is a standalone backend module that provides REST API routing and command translation for NL43/NL53 sound level meters. This service acts as a bridge between the hardware devices and frontend applications, handling all device communication, data persistence, and protocol management. SLMM is a standalone backend module that provides REST API routing and command translation for NL43/NL53 sound level meters. This service acts as a bridge between the hardware devices and frontend applications, handling all device communication, data persistence, and protocol management.
**Note:** This is a backend-only service. Actual user interfacing is done via [SFM/Terra-View](https://github.com/your-org/terra-view) frontend applications. **Note:** This is a backend-only service. Actual user interfacing is done via customized front ends or cli.
## Features ## Features
- **Live Monitor (fan-out)**: One shared cached live feed per device — many clients subscribe to the same stream instead of fighting over the meter's single TCP connection
- **Alert Engine**: Per-device threshold rules with onset/clear events, cooldowns, acks, and 24/7 evaluation
- **History & Percentiles**: Downsampled DOD trail + history endpoint for live-chart backfill; LN1/LN2 (L1/L10) percentiles surfaced through the feed
- **Persistent TCP Connections**: Cached per-device connections with OS-level keepalive, tuned for cellular modem reliability
- **Background Polling**: Continuous automatic polling of devices with configurable intervals
- **Offline Detection**: Automatic device reachability tracking with failure counters
- **Device Management**: Configure and manage multiple NL43/NL53 devices - **Device Management**: Configure and manage multiple NL43/NL53 devices
- **Real-time Monitoring**: Stream live measurement data via WebSocket - **Real-time Monitoring**: Stream live measurement data via WebSocket
- **Measurement Control**: Start, stop, pause, resume, and reset measurements - **Measurement Control**: Start, stop, pause, resume, and reset measurements
@@ -18,22 +26,72 @@ SLMM is a standalone backend module that provides REST API routing and command t
- **Device Configuration**: Manage frequency/time weighting, clock sync, and more - **Device Configuration**: Manage frequency/time weighting, clock sync, and more
- **Rate Limiting**: Automatic 1-second delay enforcement between device commands - **Rate Limiting**: Automatic 1-second delay enforcement between device commands
- **Persistent Storage**: SQLite database for device configs and measurement cache - **Persistent Storage**: SQLite database for device configs and measurement cache
- **Connection Diagnostics**: Live UI and API endpoints for monitoring TCP connection pool status
## Architecture ## Architecture
``` ```
┌─────────────────┐ ┌──────────────┐ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────────────────┐ ┌─────────────────┐
Terra-View UI │◄───────►│ SLMM API │◄───────►│ NL43/NL53 │ │◄───────►│ SLMM API │◄───────►│ NL43/NL53 │
│ (Frontend) │ HTTP │ (Backend) │ TCP │ Sound Meters │ │ (Frontend) │ HTTP │ • REST Endpoints │ TCP │ Sound Meters │
└─────────────────┘ └──────────────┘ └─────────────────┘ └─────────────────┘ │ • WebSocket Streaming │ (kept │ (via cellular │
│ • Background Poller │ alive) │ modem) │
│ • Connection Pool (v0.3) │ └─────────────────┘
└──────────────────────────────┘
┌──────────────┐ ┌──────────────┐
│ SQLite DB │ │ SQLite DB │
│ (Cache) │ • Config
│ • Status │
└──────────────┘ └──────────────┘
``` ```
### Live Monitor — Fan-Out Feed (v0.4.0)
The NL-43 allows only one TCP control connection at a time, so multiple clients
polling the same device directly would contend for it. The monitor solves this
with a single shared, cached feed per device:
- **One reader, many subscribers**: a single poller reads the device; every
WebSocket subscriber (`WS /api/nl43/{unit_id}/monitor`) receives the same
frames — an instant first frame from cache, then live updates.
- **Persistent + auto-start**: a `monitor_enabled` flag keeps the feed running
and auto-starts it on boot. Enabled alert rules pin the monitor on for 24/7
evaluation even with no UI connected.
- **Adaptive & deduplicated**: poll rate adapts to demand, unreachable devices
back off, and the background poller skips units already covered by a monitor.
### Alert Engine (v0.4.0)
Per-device threshold alerting evaluated against the live feed:
- **Rules**: metric + threshold + `cooldown_s`, full CRUD per device
- **Events**: onset/clear state machine, acknowledgement, and a device-offline
alert when a monitored unit drops
- **Robust**: editing/deleting a rule resets its state and closes open events
### Persistent TCP Connection Pool (v0.3.0)
SLMM maintains persistent TCP connections to devices with OS-level keepalive, designed for reliable operation over cellular modems:
- **Connection Reuse**: One cached TCP socket per device, reused across all commands (no repeated handshakes)
- **TCP Keepalive**: Probes keep cellular NAT tables alive and detect dead connections early
- **Transparent Retry**: Stale cached connections automatically retry with a fresh socket
- **Configurable**: Idle TTL (300s), max age (1800s), and keepalive timing via environment variables
- **Diagnostics**: Live UI on the roster page and API endpoints for monitoring pool status
### Background Polling (v0.2.0)
Background polling service continuously queries devices and updates the status cache:
- **Automatic Updates**: Devices are polled at configurable intervals (10-3600 seconds)
- **Offline Detection**: Devices marked unreachable after 3 consecutive failures
- **Per-Device Configuration**: Each device can have a custom polling interval
- **Resource Efficient**: Dynamic sleep intervals and smart scheduling
Status requests return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds).
## Quick Start ## Quick Start
### Prerequisites ### Prerequisites
@@ -77,9 +135,18 @@ Once running, visit:
### Environment Variables ### Environment Variables
**Server:**
- `PORT`: Server port (default: 8100) - `PORT`: Server port (default: 8100)
- `CORS_ORIGINS`: Comma-separated list of allowed origins (default: "*") - `CORS_ORIGINS`: Comma-separated list of allowed origins (default: "*")
**TCP Connection Pool:**
- `TCP_PERSISTENT_ENABLED`: Enable persistent connections (default: "true")
- `TCP_IDLE_TTL`: Close idle connections after N seconds (default: 300)
- `TCP_MAX_AGE`: Force reconnect after N seconds (default: 1800)
- `TCP_KEEPALIVE_IDLE`: Seconds idle before keepalive probes (default: 15)
- `TCP_KEEPALIVE_INTERVAL`: Seconds between keepalive probes (default: 10)
- `TCP_KEEPALIVE_COUNT`: Failed probes before declaring dead (default: 3)
### Database ### Database
The SQLite database is automatically created at [data/slmm.db](data/slmm.db) on first run. The SQLite database is automatically created at [data/slmm.db](data/slmm.db) on first run.
@@ -103,10 +170,49 @@ Logs are written to:
| Method | Endpoint | Description | | Method | Endpoint | Description |
|--------|----------|-------------| |--------|----------|-------------|
| GET | `/api/nl43/{unit_id}/status` | Get cached measurement snapshot | | GET | `/api/nl43/{unit_id}/status` | Get cached measurement snapshot (updated by background poller) |
| GET | `/api/nl43/{unit_id}/live` | Request fresh DOD data from device | | GET | `/api/nl43/{unit_id}/live` | Request fresh DOD data from device (bypasses cache) |
| GET | `/api/nl43/{unit_id}/history` | Downsampled DOD trail for live-chart backfill |
| WS | `/api/nl43/{unit_id}/stream` | WebSocket stream for real-time DRD data | | WS | `/api/nl43/{unit_id}/stream` | WebSocket stream for real-time DRD data |
### Live Monitor (fan-out feed)
| Method | Endpoint | Description |
|--------|----------|-------------|
| WS | `/api/nl43/{unit_id}/monitor` | Subscribe to the shared cached live feed (instant first frame) |
| POST | `/api/nl43/{unit_id}/monitor/start` | Start the device's monitor feed |
| POST | `/api/nl43/{unit_id}/monitor/stop` | Stop the device's monitor feed |
| GET | `/api/nl43/_monitor/status` | Global monitor status across devices |
| POST | `/api/nl43/{unit_id}/disconnect` | Drop the device's pooled TCP connection |
| POST | `/api/nl43/{unit_id}/deactivate` | Quiesce polling/monitoring for one device |
| POST | `/api/nl43/_system/standby` | Global standby — quiesce all polling/monitoring |
### Alerts
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/nl43/{unit_id}/alerts/rules` | List alert rules for a device |
| POST | `/api/nl43/{unit_id}/alerts/rules` | Create an alert rule (metric, threshold, cooldown) |
| PUT | `/api/nl43/{unit_id}/alerts/rules/{rule_id}` | Update a rule (resets its state, closes open events) |
| DELETE | `/api/nl43/{unit_id}/alerts/rules/{rule_id}` | Delete a rule |
| GET | `/api/nl43/{unit_id}/alerts/events` | List alert events (onset/clear) |
| POST | `/api/nl43/{unit_id}/alerts/events/{event_id}/ack` | Acknowledge an event |
### Background Polling
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/nl43/{unit_id}/polling/config` | Get device polling configuration |
| PUT | `/api/nl43/{unit_id}/polling/config` | Update polling interval and enable/disable polling |
| GET | `/api/nl43/_polling/status` | Get global polling status for all devices |
### Connection Pool
| Method | Endpoint | Description |
|--------|----------|-------------|
| GET | `/api/nl43/_connections/status` | Get pool config, active connections, age/idle times |
| POST | `/api/nl43/_connections/flush` | Force-close all cached TCP connections |
### Measurement Control ### Measurement Control
| Method | Endpoint | Description | | Method | Endpoint | Description |
@@ -167,6 +273,7 @@ slmm/
│ ├── routers.py # API route definitions │ ├── routers.py # API route definitions
│ ├── models.py # SQLAlchemy database models │ ├── models.py # SQLAlchemy database models
│ ├── services.py # NL43Client and business logic │ ├── services.py # NL43Client and business logic
│ ├── background_poller.py # Background polling service ⭐ NEW
│ └── database.py # Database configuration │ └── database.py # Database configuration
├── data/ ├── data/
│ ├── slmm.db # SQLite database (auto-created) │ ├── slmm.db # SQLite database (auto-created)
@@ -175,9 +282,12 @@ slmm/
├── templates/ ├── templates/
│ └── index.html # Simple web interface (optional) │ └── index.html # Simple web interface (optional)
├── manuals/ # Device documentation ├── manuals/ # Device documentation
├── migrate_add_polling_fields.py # Database migration for v0.2.0 ⭐ NEW
├── test_polling.sh # Polling feature test script ⭐ NEW
├── API.md # Detailed API documentation ├── API.md # Detailed API documentation
├── COMMUNICATION_GUIDE.md # NL43 protocol documentation ├── COMMUNICATION_GUIDE.md # NL43 protocol documentation
├── NL43_COMMANDS.md # Command reference ├── NL43_COMMANDS.md # Command reference
├── CHANGELOG.md # Version history ⭐ NEW
├── requirements.txt # Python dependencies ├── requirements.txt # Python dependencies
└── README.md # This file └── README.md # This file
``` ```
@@ -194,12 +304,16 @@ Stores device connection configuration:
- `ftp_username`: FTP authentication username - `ftp_username`: FTP authentication username
- `ftp_password`: FTP authentication password - `ftp_password`: FTP authentication password
- `web_enabled`: Enable/disable web interface access - `web_enabled`: Enable/disable web interface access
- `poll_interval_seconds`: Polling interval in seconds (10-3600, default: 60) ⭐ NEW
- `poll_enabled`: Enable/disable background polling for this device ⭐ NEW
### NL43Status Table ### NL43Status Table
Caches latest measurement snapshot: Caches latest measurement snapshot:
- `unit_id` (PK): Unique device identifier - `unit_id` (PK): Unique device identifier
- `last_seen`: Timestamp of last update - `last_seen`: Timestamp of last update
- `measurement_state`: Current state (Measure/Stop) - `measurement_state`: Current state (Measure/Stop)
- `measurement_start_time`: When measurement started (UTC)
- `counter`: Measurement interval counter (1-600)
- `lp`: Instantaneous sound pressure level - `lp`: Instantaneous sound pressure level
- `leq`: Equivalent continuous sound level - `leq`: Equivalent continuous sound level
- `lmax`: Maximum sound level - `lmax`: Maximum sound level
@@ -210,11 +324,43 @@ Caches latest measurement snapshot:
- `sd_remaining_mb`: Free SD card space (MB) - `sd_remaining_mb`: Free SD card space (MB)
- `sd_free_ratio`: SD card free space ratio - `sd_free_ratio`: SD card free space ratio
- `raw_payload`: Raw device response data - `raw_payload`: Raw device response data
- `is_reachable`: Device reachability status (Boolean)
- `consecutive_failures`: Count of consecutive poll failures
- `last_poll_attempt`: Last time background poller attempted to poll
- `last_success`: Last successful poll timestamp
- `last_error`: Last error message (truncated to 500 chars)
- `ln1` / `ln2`: LN1/LN2 (L1/L10) percentile levels ⭐ v0.4.0
### NL43Readings Table ⭐ v0.4.0
Downsampled DOD trail backing the live-chart history endpoint (one row/minute,
pruned to a retention window — viewing only, not the report source):
- `id` (PK), `unit_id`, `timestamp`
- `lp` / `leq` / `lmax` / `ln1` / `ln2`: cached level samples
### AlertRule Table ⭐ v0.4.0
Per-device threshold alert rules:
- `id` (PK), `unit_id`, `name`, `enabled`
- `metric`, `comparison` (above/below), `threshold_db`, `clear_margin_db` (hysteresis)
- `duration_s` (sustained), `cooldown_s` (min seconds between onsets)
- `channels` / `recipients`, optional `schedule_start`/`schedule_end`/`schedule_days`
### AlertEvent Table ⭐ v0.4.0
Alert onset/clear events for history, inbox, and acknowledgement:
- `id` (PK), `unit_id`, `rule_id`, `rule_name`, `metric`, `threshold_db`
- `onset_at` / `onset_value`, `peak_value`, `clear_at`, `status` (active/cleared)
- `acknowledged_at` / `acknowledged_by`, `notes`
> New tables (`alert_rules`, `alert_events`, `nl43_readings`) auto-create on
> startup. Existing-table columns ship with migrations:
> `migrate_add_ln_percentiles.py`, `migrate_add_monitor_enabled.py`.
## Protocol Details ## Protocol Details
### TCP Communication ### TCP Communication
- Uses ASCII command protocol over TCP - Uses ASCII command protocol over TCP
- Persistent connections with OS-level keepalive (tuned for cellular modems)
- Connections cached per device and reused across commands
- Transparent retry on stale connections
- Enforces ≥1 second delay between commands to same device - Enforces ≥1 second delay between commands to same device
- Two-line response format: - Two-line response format:
- Line 1: Result code (R+0000 for success) - Line 1: Result code (R+0000 for success)
@@ -253,11 +399,43 @@ curl -X PUT http://localhost:8100/api/nl43/meter-001/config \
curl -X POST http://localhost:8100/api/nl43/meter-001/start curl -X POST http://localhost:8100/api/nl43/meter-001/start
``` ```
### Get Live Status ### Get Cached Status (Fast - from background poller)
```bash
curl http://localhost:8100/api/nl43/meter-001/status
```
### Get Live Status (Bypasses cache)
```bash ```bash
curl http://localhost:8100/api/nl43/meter-001/live curl http://localhost:8100/api/nl43/meter-001/live
``` ```
### Configure Background Polling ⭐ NEW
```bash
# Set polling interval to 30 seconds
curl -X PUT http://localhost:8100/api/nl43/meter-001/polling/config \
-H "Content-Type: application/json" \
-d '{
"poll_interval_seconds": 30,
"poll_enabled": true
}'
# Get polling configuration
curl http://localhost:8100/api/nl43/meter-001/polling/config
# Check global polling status
curl http://localhost:8100/api/nl43/_polling/status
```
### Check Connection Pool Status
```bash
curl http://localhost:8100/api/nl43/_connections/status | jq '.'
```
### Flush All Cached Connections
```bash
curl -X POST http://localhost:8100/api/nl43/_connections/flush
```
### Verify Device Settings ### Verify Device Settings
```bash ```bash
curl http://localhost:8100/api/nl43/meter-001/settings curl http://localhost:8100/api/nl43/meter-001/settings
@@ -326,11 +504,19 @@ See [API.md](API.md) for detailed integration examples.
## Troubleshooting ## Troubleshooting
### Connection Issues ### Connection Issues
- Check connection pool status: `curl http://localhost:8100/api/nl43/_connections/status`
- Flush stale connections: `curl -X POST http://localhost:8100/api/nl43/_connections/flush`
- Verify device IP address and port in configuration - Verify device IP address and port in configuration
- Ensure device is on the same network - Ensure device is on the same network
- Check firewall rules allow TCP/FTP connections - Check firewall rules allow TCP/FTP connections
- Verify RX55 network adapter is properly configured on device - Verify RX55 network adapter is properly configured on device
### Cellular Modem Issues
- If modem wedges from too many handshakes, ensure `TCP_PERSISTENT_ENABLED=true` (default)
- Increase `TCP_IDLE_TTL` if connections expire between poll cycles
- Keepalive probes (default: every 15s) keep NAT tables alive — adjust `TCP_KEEPALIVE_IDLE` if needed
- Set `TCP_PERSISTENT_ENABLED=false` to disable pooling for debugging
### Rate Limiting ### Rate Limiting
- API automatically enforces 1-second delay between commands - API automatically enforces 1-second delay between commands
- If experiencing delays, this is normal device behavior - If experiencing delays, this is normal device behavior
@@ -356,13 +542,31 @@ pytest
### Database Migrations ### Database Migrations
```bash ```bash
# Migrate existing database to add FTP credentials # Migrate to v0.2.0 (add background polling fields)
python3 migrate_add_polling_fields.py
# Legacy: Migrate to add FTP credentials
python migrate_add_ftp_credentials.py python migrate_add_ftp_credentials.py
# Set FTP credentials for a device # Set FTP credentials for a device
python set_ftp_credentials.py <unit_id> <username> <password> python set_ftp_credentials.py <unit_id> <username> <password>
``` ```
### Testing Background Polling
```bash
# Run comprehensive polling tests
./test_polling.sh [unit_id]
# Test settings endpoint
python3 test_settings_endpoint.py <unit_id>
# Test sleep mode auto-disable
python3 test_sleep_mode_auto_disable.py <unit_id>
```
### Legacy Scripts
Old migration scripts and manual polling tools have been moved to `archive/` for reference. See [archive/README.md](archive/README.md) for details.
## Contributing ## Contributing
This is a standalone module kept separate from the SFM/Terra-View codebase. When contributing: This is a standalone module kept separate from the SFM/Terra-View codebase. When contributing:
@@ -0,0 +1,403 @@
# NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps
**Last updated:** 2026-02-18
**Owner:** Brian / serversdown
**Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55
---
## 0) What this document is
This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable.
This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**.
---
## 1) Architecture (as tested)
### Network path
- **Server (SLMM host):** `10.0.0.40`
- **RX55 WAN IP:** `63.45.161.30`
- **RX55 LAN subnet:** `192.168.1.0/24`
- **RX55 LAN gateway:** `192.168.1.1`
- **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation)
### RX55 details
- **Sierra Wireless RX55**
- **OS:** 5.2
- **Firmware:** `01.14.24.00`
- **Carrier:** Verizon LTE (Band 66)
### Port forwarding rules (RX55)
- **WAN:2255 → NL-43:2255** (NL-43 TCP control)
- **WAN:21 → NL-43:21** (NL-43 FTP control)
You also experimented with additional forwards:
- **WAN:2253 → NL-43:2255** (test)
- **WAN:2253 → NL-43:2253** (test)
- **WAN:4450 → NL-43:4450** (test)
**Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x.
---
## 2) Original problem statement (the “wedge”)
After running for hours, the NL-43 becomes unreachable over TCP control.
### Symptom signature (WAN-side)
- Client attempts to connect to `63.45.161.30:2255`
- Instead of timing out, the client gets **connection refused** quickly.
- Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop)
### Critical operational behavior
- **Power cycling the NL-43 fixes it.**
- **Power cycling the RX55 does NOT fix it.**
- FTP sometimes remains available even while TCP control (2255) is dead.
This combination is what forced us to determine whether:
- The RX55 is rejecting connections, OR
- The NL-43 is no longer listening on 2255, OR
- Something about the RX55 path triggers the NL-43s control listener to die.
---
## 3) Event timeline evidence (SLMM logs)
A concrete wedge window was observed on **2026-02-18**:
- 10:55:46 AM — Poll success (Start)
- 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded)
- 11:55:50 AM — Poll success (Stop)
- 12:55:55 PM — Poll success (Stop)
- **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)**
- 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused)
Key interpretation:
- The wedge occurred sometime between **12:55 and 1:55**.
- The failure type is **refused**, not timeout.
---
## 4) Early hypotheses (before proof)
We considered two main buckets:
### A) NL-43-side failure (most suspicious)
- NL-43 TCP control service crashes / exits / unbinds from 2255
- socket leak / accept backlog exhaustion
- “single control session allowed” and it gets stuck thinking a session is active
- mode/service manager bug (service restart fails after other activities)
- firmware bug in TCP daemon
### B) RX55-side failure (possible trigger / less likely once FTP works)
- NAT/forwarding table corruption
- firewall behavior
- helper/ALG interference
- MSS/MTU weirdness causing edge-case behavior
- session churn behavior causing downstream issues
---
## 5) Key experiments and what they proved
### 5.1) LAN-only stability test (No RX55 path)
**Test:** NL-43 tested directly on LAN (no modem path involved).
- Ran **24+ hours**
- Scheduler start/stop cycles worked
- Stress test: **500 commands @ 1/sec** → no failure
- Response time trend decreased (not degrading)
**Result:** The NL-43 appears stable in a “pure LAN” environment.
**Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime.
---
### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization)
You observed:
- **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop)
- **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST)
Concrete example:
- Port **4450** with no rule → timeout
- Port **4450 → NL-43:4450** rule created → connection refused
**Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target.
Important nuance:
- A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop.
---
### 5.3) UDP test confusion (and resolution)
You ran:
```bash
nc -vzu 63.45.161.30 2255
nc -vz 63.45.161.30 2255
```
Observed:
- UDP: “succeeded”
- TCP: “connection refused”
Resolution:
- UDP has **no handshake**. netcat prints “succeeded” if it doesnt immediately receive an ICMP unreachable. It does **not** mean a UDP service exists.
- TCP refused is meaningful: a RST implies “no listener” or “actively rejected.”
**Net effect:** UDP test did not change the diagnosis.
---
### 5.4) Packet capture proof (WAN-side)
You captured a Wireshark/tcpdump summary with these key patterns:
#### Port 2255 (TCP control)
Example:
- `10.0.0.40 → 63.45.161.30:2255` SYN
- `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms
This happened repeatedly.
#### Port 2253 (test port)
Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment).
#### Port 21 (FTP)
Clean 3-way handshake:
- SYN → SYN/ACK → ACK
Then:
- FTP server banner: `220 Connection Ready`
Then:
- `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment)
Session closes cleanly.
**Key takeaway from capture:**
- TCP transport to NL-43 via RX55 is definitely working (port 21 proves it).
- Port 2255 is being actively refused.
This strongly suggested “2255 listener is gone,” but still didnt fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43.
---
## 6) The decisive experiment: LAN-side test while wedged (final proof)
Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**.
### 6.1) Physical LAN tap setup
Constraint:
- NL-43 has only one Ethernet port.
Solution:
- Insert an unmanaged switch:
- RX55 LAN → switch
- NL-43 → switch
- Windows 10 laptop → switch
This creates a shared L2 segment where the laptop can test NL-43 directly.
### 6.2) Windows LAN validation
On the Windows laptop:
- `ipconfig` showed:
- IP: `192.168.1.100`
- Gateway: `192.168.1.1` (RX55)
- Initial `arp -a` only showed RX55, not NL-43.
You then:
- pinged likely host addresses and discovered NL-43 responds on **192.168.1.10**
- `arp -a` then showed:
- `192.168.1.10 → 00-10-50-14-0a-d8`
- OUI `00-10-50` recognized as **Rion** (matches NL-43)
So LAN identities were confirmed:
- RX55: `192.168.1.1`
- NL-43: `192.168.1.10`
### 6.3) The LAN port tests (the smoking gun)
From Windows:
```powershell
Test-NetConnection -ComputerName 192.168.1.10 -Port 2255
Test-NetConnection -ComputerName 192.168.1.10 -Port 21
```
Results (while the unit was “wedged” from the WAN perspective):
- **2255:** `TcpTestSucceeded : False`
- **21:** `TcpTestSucceeded : True`
**Conclusion (PROVEN):**
- The NL-43 is reachable on the LAN
- FTP port 21 is alive
- **The NL-43 is NOT listening on TCP port 2255**
- Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255.
This is now settled.
---
## 7) What we learned (final conclusions)
### 7.1) RX55 innocence (for this failure mode)
The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared.
It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*.
### 7.2) NL-43 control listener failure
The NL-43s TCP control service (port 2255) stops listening while:
- the device remains alive
- the LAN stack remains alive (ping)
- FTP remains alive (port 21)
This looks like one of:
- control daemon crash/exit
- service unbind
- stuck service state (e.g., “busy” / “session active forever”)
- resource leak (sockets/file descriptors) specific to the control service
- firmware service manager bug (start/stop of services fails after certain sequences)
---
## 8) Additional constraint discovered: “Web App mode” conflicts
You noted an important operational constraint:
> Turning on the web app disables other interfaces like TCP and FTP.
Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because:
- If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle.
- It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed.
We have not yet run a controlled long test to determine whether:
- mode switching contributes directly to the 2255 listener dying, OR
- it happens even in a pure TCP-only mode with no switching.
---
## 9) Immediate operational decision (field tomorrow)
Because the device is needed in the field immediately, you chose:
- **Old-school manual deployment**
- **Manual SD card downloads**
- Avoid reliance on 2255/TCP control and remote workflows for now.
**Important operational note:**
The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency.
---
## 10) Whats next (future work — when the unit is back)
Because long tests cant be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation.
### 10.1) Controlled experiment matrix (recommended)
Run each test for 2472 hours, or until wedge occurs, and record:
- number of TCP connects
- whether connections are persistent
- whether FTP is used
- whether any mode toggling is performed
- time-to-wedge
#### Test A — TCP-only (ideal baseline)
- TCP control only (2255)
- **True persistent connection** (open once, keep forever)
- No FTP
- No web mode toggling
Outcome interpretation:
- If stable: connection churn and/or FTP/mode switching is the trigger.
- If wedges anyway: pure 2255 daemon leak/bug.
#### Test B — TCP with connection churn
- Same as A but intentionally reconnect on a schedule (current SLMM behavior)
- No FTP
Outcome:
- If this wedges but A doesnt: churn is the trigger.
#### Test C — FTP activity + TCP
- Introduce scheduled FTP sessions (downloads) while using TCP control
- Observe whether wedge correlates with FTP use or with post-download periods.
Outcome:
- If wedge correlates with FTP, suspect internal service lifecycle conflict.
#### Test D — Web mode interaction (only if safe/possible)
- Evaluate what toggling web mode does to TCP/FTP services.
- Determine if any remote-safe “soft reset” exists.
---
## 11) Mitigation options (ranked)
### Option 1 — Make SLMM truly persistent (highest probability of success)
If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is:
- Open one TCP socket per device
- Keep it open indefinitely
- Use OS keepalive
- Do **not** rotate connections on timers
- Reconnect only when the socket actually dies
This reduces:
- connect/close cycles
- NAT edge-case exposure
- resource churn inside NL-43
### Option 2 — Service “soft reset” (if possible without disabling required services)
If there exists any way to restart the 2255 service without power cycling:
- LAN TCP toggle (if it doesnt require web mode)
- any “restart comms” command (unknown)
- any maintenance menu sequence
then SLMM could:
- detect wedge
- trigger soft reset
- recover automatically
Current constraint: web app mode appears to disable other services, so this may not be viable.
### Option 3 — Hardware watchdog power cycle (industrial but reliable)
If this is a firmware bug with no clean workaround:
- Add a remotely controlled relay/power switch
- On wedge detection, power-cycle NL-43 automatically
- Optionally schedule a nightly power cycle to prevent leak accumulation
This is “field reality” and often the only long-term move with embedded devices.
### Option 4 — Vendor escalation (Rion)
You now have excellent evidence:
- LAN-side proof: 2255 dead while 21 alive
- WAN packet evidence
- clear isolation of RX55 innocence
This is strong enough to send to Rion support as a firmware defect report.
---
## 12) Repro “wedge bundle” checklist (for future captures)
When the wedge happens again, capture these before power cycling:
1) From server:
- `nc -vz 63.45.161.30 2255` (expect refused)
- `nc -vz 63.45.161.30 21` (expect success if FTP alive)
2) From LAN side (via switch/laptop):
- `Test-NetConnection 192.168.1.10 -Port 2255`
- `Test-NetConnection 192.168.1.10 -Port 21`
3) Optional: packet capture around the refused attempt.
4) Record:
- last successful poll timestamp
- last FTP session timestamp
- any scheduled start/stop/download cycles near wedge time
- SLMM connection reuse/rotation settings in effect
---
## 13) Final, current-state summary (as of 2026-02-18)
- The issue is **NOT** the RX55 rejecting inbound connections.
- The NL-43 is **alive**, reachable on LAN, and FTP works.
- The NL-43s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy.
- The wedge can occur hours after successful operations.
- The unit is needed in the field immediately, so investigation pauses.
- Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset).
---
## 14) Notes / misc observations
- The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMMs “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow.
- UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned.
---
**End of document.**
File diff suppressed because it is too large Load Diff
+322
View File
@@ -0,0 +1,322 @@
"""
Threshold alert engine.
Each unit can have any number of AlertRules. A rule is evaluated against the
unit's live monitor snapshots via a small per-(unit, rule) state machine:
IDLE --(metric exceeds threshold for duration_s)--> ACTIVE (fire ONSET)
ACTIVE --(metric recovers past hysteresis for duration_s)--> IDLE (fire CLEAR)
duration_s debounces both edges; clear_margin_db adds hysteresis so a level
hovering at the threshold doesn't flap. Onset and clear are distinct events.
The state-machine logic (`_evaluate_step`) is intentionally pure — no DB, no
real clock — so it can be unit-tested with a synthetic level series and a fake
clock. The AlertEvaluator wraps it with rule loading, scheduling, persistence,
and dispatch. Dispatch is a server log for now (POC); the seam to POST events to
a Terra-View webhook (email/SMS) is _dispatch().
"""
import asyncio
import logging
import os
from dataclasses import dataclass
from datetime import datetime, timedelta
from typing import Dict, List, Optional, Tuple
logger = logging.getLogger(__name__)
# Local timezone offset for schedule windows (same env var services.py uses).
_TZ_OFFSET_HOURS = float(os.getenv("TIMEZONE_OFFSET", "-5"))
# How long to cache a unit's rules before re-querying the DB (rules change rarely).
_RULE_CACHE_TTL_S = 15.0
@dataclass
class RuleState:
"""In-memory runtime state for one (unit, rule)."""
phase: str = "idle" # "idle" | "active"
edge_since: Optional[float] = None # when the current edge condition began (clock time)
peak: float = 0.0
event_id: Optional[int] = None # the open AlertEvent row (for the clear update)
last_onset: Optional[float] = None # time of the last onset (for cooldown)
def _exceeds(value: float, rule) -> bool:
if rule.comparison == "below":
return value < rule.threshold_db
return value > rule.threshold_db
def _recovered(value: float, rule) -> bool:
margin = rule.clear_margin_db or 0.0
if rule.comparison == "below":
return value > rule.threshold_db + margin
return value < rule.threshold_db - margin
def _evaluate_step(state: RuleState, value: float, now: float, rule) -> Optional[str]:
"""Advance the state machine by one reading.
Pure: mutates `state`, returns 'onset' | 'clear' | None. `now` is injected so
tests can drive a fake clock.
"""
duration = rule.duration_s or 0
if state.phase == "idle":
if _exceeds(value, rule):
if state.edge_since is None:
state.edge_since = now
if now - state.edge_since >= duration:
# Cooldown: suppress a new onset within cooldown_s of the last one
# (stops a repeatedly-breaching signal from flooding the history).
# Hold edge_since so it fires the moment cooldown lapses if still
# breaching — don't reset it here.
cooldown = getattr(rule, "cooldown_s", 0) or 0
if state.last_onset is not None and (now - state.last_onset) < cooldown:
return None
state.phase = "active"
state.edge_since = None
state.peak = value
state.last_onset = now
return "onset"
else:
state.edge_since = None
return None
# active
if rule.comparison == "below":
state.peak = min(state.peak, value)
else:
state.peak = max(state.peak, value)
if _recovered(value, rule):
if state.edge_since is None:
state.edge_since = now
if now - state.edge_since >= duration:
state.phase = "idle"
state.edge_since = None
return "clear"
else:
state.edge_since = None
return None
def _in_window(now_minutes: int, start: str, end: str) -> bool:
"""Is now_minutes (minutes since local midnight) within [start, end)?
Handles wraparound windows like 22:0007:00."""
def _m(s: str) -> int:
h, m = s.split(":")
return int(h) * 60 + int(m)
s, e = _m(start), _m(end)
if s == e:
return True
if s < e:
return s <= now_minutes < e
return now_minutes >= s or now_minutes < e # wraparound
class AlertEvaluator:
def __init__(self):
self._states: Dict[Tuple[str, int], RuleState] = {}
self._rule_cache: Dict[str, Tuple[float, list]] = {} # unit_id -> (fetched_at, rules)
self._offline_events: Dict[str, int] = {} # unit_id -> open connectivity AlertEvent id
logger.info("[ALERT] rule-based evaluator ready")
async def evaluate(self, unit_id: str, snap) -> None:
"""Evaluate every enabled rule for this unit against one snapshot."""
rules = self._get_rules(unit_id)
if not rules:
return
now = asyncio.get_running_loop().time()
for rule in rules:
if not self._in_schedule(rule):
continue
raw = getattr(snap, rule.metric, None)
try:
value = float(raw)
except (TypeError, ValueError):
continue # missing / non-numeric ("-.-")
state = self._states.setdefault((unit_id, rule.id), RuleState())
action = _evaluate_step(state, value, now, rule)
if action == "onset":
await self._on_onset(unit_id, rule, value, state)
elif action == "clear":
await self._on_clear(unit_id, rule, value, state)
# -- rule loading (cached) ----------------------------------------------
def _get_rules(self, unit_id: str) -> list:
loop_now = asyncio.get_running_loop().time()
cached = self._rule_cache.get(unit_id)
if cached and loop_now - cached[0] < _RULE_CACHE_TTL_S:
return cached[1]
rules = self._load_rules(unit_id)
self._rule_cache[unit_id] = (loop_now, rules)
return rules
def _load_rules(self, unit_id: str) -> list:
from app.database import SessionLocal
from app.models import AlertRule
db = SessionLocal()
try:
return db.query(AlertRule).filter_by(unit_id=unit_id, enabled=True).all()
except Exception as e:
logger.warning(f"[ALERT] failed to load rules for {unit_id}: {e}")
return []
finally:
db.close()
def invalidate(self, unit_id: Optional[str] = None) -> None:
"""Drop cached rules so a change is picked up immediately."""
if unit_id is None:
self._rule_cache.clear()
else:
self._rule_cache.pop(unit_id, None)
def forget_rule(self, unit_id: str, rule_id: int) -> None:
"""Drop a rule's per-(unit, rule) state machine after the rule is edited or
deleted, so a stale 'active' phase / open event_id from the old config
doesn't bleed into the new one (mis-firing a clear or suppressing an onset)."""
self._states.pop((unit_id, rule_id), None)
# -- scheduling ----------------------------------------------------------
def _in_schedule(self, rule) -> bool:
if not rule.schedule_start or not rule.schedule_end:
day_ok = self._day_ok(rule)
return day_ok
local = datetime.utcnow() + timedelta(hours=_TZ_OFFSET_HOURS)
if not self._day_ok(rule, local):
return False
return _in_window(local.hour * 60 + local.minute, rule.schedule_start, rule.schedule_end)
@staticmethod
def _day_ok(rule, local: Optional[datetime] = None) -> bool:
if not rule.schedule_days:
return True
if local is None:
local = datetime.utcnow() + timedelta(hours=_TZ_OFFSET_HOURS)
allowed = {int(d) for d in str(rule.schedule_days).split(",") if d.strip() != ""}
return local.weekday() in allowed # Mon=0
# -- event persistence + dispatch ---------------------------------------
async def _on_onset(self, unit_id: str, rule, value: float, state: RuleState) -> None:
from app.database import SessionLocal
from app.models import AlertEvent
db = SessionLocal()
try:
evt = AlertEvent(
rule_id=rule.id, unit_id=unit_id, rule_name=rule.name,
metric=rule.metric, threshold_db=rule.threshold_db,
onset_value=value, peak_value=value, status="active",
)
db.add(evt)
db.commit()
db.refresh(evt)
state.event_id = evt.id
except Exception as e:
logger.warning(f"[ALERT] failed to record onset for {unit_id}: {e}")
finally:
db.close()
await self._dispatch(
"ONSET", unit_id, rule,
f"{rule.metric.upper()}={value:.1f} dB "
f"{'<' if rule.comparison == 'below' else '>'} {rule.threshold_db:.1f} dB"
f"{f' for {rule.duration_s}s' if rule.duration_s else ''}",
)
async def _on_clear(self, unit_id: str, rule, value: float, state: RuleState) -> None:
peak = state.peak
from app.database import SessionLocal
from app.models import AlertEvent
db = SessionLocal()
try:
if state.event_id is not None:
evt = db.query(AlertEvent).filter_by(id=state.event_id).first()
if evt:
evt.clear_at = datetime.utcnow()
evt.peak_value = peak
evt.status = "cleared"
db.commit()
except Exception as e:
logger.warning(f"[ALERT] failed to record clear for {unit_id}: {e}")
finally:
db.close()
state.event_id = None
await self._dispatch(
"CLEAR", unit_id, rule,
f"recovered to {value:.1f} dB (peak {peak:.1f} dB)",
)
# -- connectivity (device offline/online) -------------------------------
#
# Raised by the live monitor when it loses / regains contact with a device.
# Persisted as an AlertEvent (sentinel rule_id=0, metric="connectivity") so it
# lands in the same events/inbox/ack pipeline as threshold alerts. The in-memory
# map dedupes; the DB query also dedupes across a process restart.
async def device_offline(self, unit_id: str) -> None:
if unit_id in self._offline_events:
return # already flagged offline
from app.database import SessionLocal
from app.models import AlertEvent
db = SessionLocal()
try:
existing = db.query(AlertEvent).filter_by(
unit_id=unit_id, metric="connectivity", status="active").first()
if existing: # already open in the DB (e.g. carried across a restart)
self._offline_events[unit_id] = existing.id
return
evt = AlertEvent(
rule_id=0, unit_id=unit_id, rule_name="Device unreachable",
metric="connectivity", threshold_db=0.0, status="active",
)
db.add(evt)
db.commit()
db.refresh(evt)
self._offline_events[unit_id] = evt.id
except Exception as e:
logger.warning(f"[ALERT] failed to record offline for {unit_id}: {e}")
finally:
db.close()
await self._dispatch_raw("OFFLINE", unit_id, "Device unreachable",
"live monitor lost contact with the device")
async def device_online(self, unit_id: str) -> None:
self._offline_events.pop(unit_id, None)
from app.database import SessionLocal
from app.models import AlertEvent
db = SessionLocal()
cleared = 0
try:
opened = db.query(AlertEvent).filter_by(
unit_id=unit_id, metric="connectivity", status="active").all()
for evt in opened:
evt.clear_at = datetime.utcnow()
evt.status = "cleared"
cleared += 1
if cleared:
db.commit()
except Exception as e:
logger.warning(f"[ALERT] failed to record online for {unit_id}: {e}")
finally:
db.close()
if cleared: # only announce recovery if it was actually flagged offline
await self._dispatch_raw("ONLINE", unit_id, "Device recovered",
"live monitor regained contact with the device")
# -- event persistence + dispatch ---------------------------------------
async def _dispatch(self, kind: str, unit_id: str, rule, detail: str) -> None:
await self._dispatch_raw(kind, unit_id, rule.name, detail)
async def _dispatch_raw(self, kind: str, unit_id: str, name: str, detail: str) -> None:
"""POC dispatch: server log. Swap in a Terra-View webhook (email/SMS) here."""
logger.warning(f"[ALERT:{kind}] {unit_id} '{name}': {detail}")
# Module-level singleton (the monitor calls alert_evaluator.evaluate per snapshot)
alert_evaluator = AlertEvaluator()
+411
View File
@@ -0,0 +1,411 @@
"""
Background polling service for NL43 devices.
This module provides continuous, automatic polling of configured NL43 devices
at configurable intervals. Status snapshots are persisted to the database
for fast API access without querying devices on every request.
"""
import asyncio
import logging
import os
from datetime import datetime, timedelta
from typing import Optional
from sqlalchemy.orm import Session
from app.database import SessionLocal
from app.models import NL43Config, NL43Status
from app.services import NL43Client, persist_snapshot, sync_measurement_start_time_from_ftp
from app.device_logger import log_device_event, cleanup_old_logs
logger = logging.getLogger(__name__)
# Global polling default. Set SLMM_POLLING_ENABLED=false to start an instance in
# standby (running but not polling and not holding device connections) — e.g. a
# dev box that must not latch onto a device that a prod instance owns.
POLLING_ENABLED_DEFAULT = os.getenv("SLMM_POLLING_ENABLED", "true").lower() == "true"
class BackgroundPoller:
"""
Background task that continuously polls NL43 devices and updates status cache.
Features:
- Per-device configurable poll intervals (30 seconds to 6 hours)
- Automatic offline detection (marks unreachable after 3 consecutive failures)
- Dynamic sleep intervals based on device configurations
- Graceful shutdown on application stop
- Respects existing rate limiting (1-second minimum between commands)
"""
def __init__(self):
self._task: Optional[asyncio.Task] = None
self._running = False
self._logger = logger
self._last_cleanup = None # Track last log cleanup time
self._last_pool_log = None # Track last connection pool heartbeat log
self._active = POLLING_ENABLED_DEFAULT # Global polling on/off (standby toggle)
async def start(self):
"""Start the background polling task."""
if self._running:
self._logger.warning("Background poller already running")
return
self._running = True
self._task = asyncio.create_task(self._poll_loop())
self._logger.info("Background poller task created")
async def stop(self):
"""Gracefully stop the background polling task."""
if not self._running:
return
self._logger.info("Stopping background poller...")
self._running = False
if self._task:
try:
await asyncio.wait_for(self._task, timeout=5.0)
except asyncio.TimeoutError:
self._logger.warning("Background poller task did not stop gracefully, cancelling...")
self._task.cancel()
try:
await self._task
except asyncio.CancelledError:
pass
self._logger.info("Background poller stopped")
def is_active(self) -> bool:
"""Whether background polling is currently active (vs standby)."""
return self._active
async def set_active(self, active: bool):
"""Globally enable/disable polling at runtime.
When deactivated, the loop stays alive but polls nothing and releases all
device connections, so this SLMM instance stops occupying the devices'
single connection slots (e.g. so a prod instance can take over). Runtime
state only — on restart the instance returns to SLMM_POLLING_ENABLED.
"""
self._active = active
if active:
self._logger.info("[SYSTEM] Background polling ACTIVATED")
else:
self._logger.info("[SYSTEM] Background polling DEACTIVATED (standby) — releasing connections")
await self._release_all_connections()
async def _release_all_connections(self):
"""Gracefully close every pooled device connection (no-op if none)."""
from app.services import _connection_pool
for device_key in list(_connection_pool.get_stats().get("connections", {})):
await _connection_pool.discard(device_key)
async def _poll_loop(self):
"""Main polling loop that runs continuously."""
self._logger.info("Background polling loop started")
while self._running:
if self._active:
try:
await self._poll_all_devices()
except Exception as e:
self._logger.error(f"Error in poll loop: {e}", exc_info=True)
else:
# Standby: poll nothing, and keep holding no device connection slots
# so another SLMM instance (e.g. prod) can talk to the devices.
try:
await self._release_all_connections()
except Exception as e:
self._logger.warning(f"Standby connection release failed: {e}")
# Run log cleanup once per hour
try:
now = datetime.utcnow()
if self._last_cleanup is None or (now - self._last_cleanup).total_seconds() > 3600:
cleanup_old_logs()
self._last_cleanup = now
except Exception as e:
self._logger.warning(f"Log cleanup failed: {e}")
# Log connection pool status every 15 minutes
try:
now = datetime.utcnow()
if self._last_pool_log is None or (now - self._last_pool_log).total_seconds() > 900:
from app.services import _connection_pool
stats = _connection_pool.get_stats()
conns = stats.get("connections", {})
if conns:
for key, c in conns.items():
self._logger.info(
f"[POOL] {key} — age={c['age_seconds']}s idle={c['idle_seconds']}s alive={c['alive']}"
)
else:
self._logger.info("[POOL] No active connections in pool")
self._last_pool_log = now
except Exception as e:
self._logger.warning(f"Pool status log failed: {e}")
# Calculate dynamic sleep interval
sleep_time = self._calculate_sleep_interval()
self._logger.debug(f"Sleeping for {sleep_time} seconds until next poll cycle")
# Sleep in small intervals to allow graceful shutdown
for _ in range(int(sleep_time)):
if not self._running:
break
await asyncio.sleep(1)
self._logger.info("Background polling loop exited")
async def _poll_all_devices(self):
"""Poll all configured devices that are due for polling."""
db: Session = SessionLocal()
try:
# Get all devices with TCP and polling enabled
configs = db.query(NL43Config).filter_by(
tcp_enabled=True,
poll_enabled=True
).all()
if not configs:
self._logger.debug("No devices configured for polling")
return
self._logger.debug(f"Checking {len(configs)} devices for polling")
now = datetime.utcnow()
polled_count = 0
from app.monitor import monitor_manager
for cfg in configs:
if not self._running:
break
# Skip units with an active live monitor: it polls them at ~1Hz and
# keeps the status cache fresh, so a redundant background poll would just
# add load/lock-contention on the device's single connection.
if monitor_manager.is_active(cfg.unit_id):
self._logger.debug(f"Skipping {cfg.unit_id} — live monitor active")
continue
# Get current status
status = db.query(NL43Status).filter_by(unit_id=cfg.unit_id).first()
# Check if device should be polled
if self._should_poll(cfg, status, now):
await self._poll_device(cfg, db)
polled_count += 1
else:
self._logger.debug(f"Skipping {cfg.unit_id} - interval not elapsed")
if polled_count > 0:
self._logger.info(f"Polled {polled_count}/{len(configs)} devices")
finally:
db.close()
def _should_poll(self, cfg: NL43Config, status: Optional[NL43Status], now: datetime) -> bool:
"""
Determine if a device should be polled based on interval and last poll time.
Args:
cfg: Device configuration
status: Current device status (may be None if never polled)
now: Current UTC timestamp
Returns:
True if device should be polled, False otherwise
"""
# If never polled before, poll now
if not status or not status.last_poll_attempt:
self._logger.debug(f"Device {cfg.unit_id} never polled, polling now")
return True
# Calculate elapsed time since last poll attempt
interval = cfg.poll_interval_seconds or 60
elapsed = (now - status.last_poll_attempt).total_seconds()
should_poll = elapsed >= interval
if should_poll:
self._logger.debug(
f"Device {cfg.unit_id} due for polling: {elapsed:.1f}s elapsed, interval={interval}s"
)
return should_poll
async def _poll_device(self, cfg: NL43Config, db: Session):
"""
Poll a single device and update its status in the database.
Args:
cfg: Device configuration
db: Database session
"""
unit_id = cfg.unit_id
self._logger.info(f"Polling device {unit_id} at {cfg.host}:{cfg.tcp_port}")
# Get or create status record
status = db.query(NL43Status).filter_by(unit_id=unit_id).first()
if not status:
status = NL43Status(unit_id=unit_id)
db.add(status)
# Update last_poll_attempt immediately
status.last_poll_attempt = datetime.utcnow()
db.commit()
# Create client and attempt to poll
client = NL43Client(
cfg.host,
cfg.tcp_port,
timeout=5.0,
ftp_username=cfg.ftp_username,
ftp_password=cfg.ftp_password,
ftp_port=cfg.ftp_port or 21
)
try:
# Send DOD? command to get device status
snap = await client.request_dod()
snap.unit_id = unit_id
# Success - persist snapshot and reset failure counter
persist_snapshot(snap, db)
status.is_reachable = True
status.consecutive_failures = 0
status.last_success = datetime.utcnow()
status.last_error = None
db.commit()
self._logger.info(f"✓ Successfully polled {unit_id}")
# Log to device log
log_device_event(
unit_id, "INFO", "POLL",
f"Poll success: state={snap.measurement_state}, Leq={snap.leq}, Lp={snap.lp}",
db
)
# Check if device is measuring but has no start time recorded
# This happens if measurement was started before SLMM began polling
# or after a service restart
status = db.query(NL43Status).filter_by(unit_id=unit_id).first()
# Reset the sync flag when measurement stops (so next measurement can sync)
if status and status.measurement_state != "Start":
if status.start_time_sync_attempted:
status.start_time_sync_attempted = False
db.commit()
self._logger.debug(f"Reset FTP sync flag for {unit_id} (measurement stopped)")
log_device_event(unit_id, "DEBUG", "STATE", "Measurement stopped, reset FTP sync flag", db)
# Attempt FTP sync if:
# - Device is measuring
# - No start time recorded
# - FTP sync not already attempted for this measurement
# - FTP is configured
if (status and
status.measurement_state == "Start" and
status.measurement_start_time is None and
not status.start_time_sync_attempted and
cfg.ftp_enabled and
cfg.ftp_username and
cfg.ftp_password):
self._logger.info(
f"Device {unit_id} is measuring but has no start time - "
f"attempting FTP sync"
)
log_device_event(unit_id, "INFO", "SYNC", "Attempting FTP sync for measurement start time", db)
# Mark that we attempted sync (prevents repeated attempts on failure)
status.start_time_sync_attempted = True
db.commit()
try:
synced = await sync_measurement_start_time_from_ftp(
unit_id=unit_id,
host=cfg.host,
tcp_port=cfg.tcp_port,
ftp_port=cfg.ftp_port or 21,
ftp_username=cfg.ftp_username,
ftp_password=cfg.ftp_password,
db=db
)
if synced:
self._logger.info(f"✓ FTP sync succeeded for {unit_id}")
log_device_event(unit_id, "INFO", "SYNC", "FTP sync succeeded - measurement start time updated", db)
else:
self._logger.warning(f"FTP sync returned False for {unit_id}")
log_device_event(unit_id, "WARNING", "SYNC", "FTP sync returned False", db)
except Exception as sync_err:
self._logger.warning(
f"FTP sync failed for {unit_id}: {sync_err}"
)
log_device_event(unit_id, "ERROR", "SYNC", f"FTP sync failed: {sync_err}", db)
except Exception as e:
# Failure - increment counter and potentially mark offline
status.consecutive_failures += 1
error_msg = str(e)[:500] # Truncate to prevent bloat
status.last_error = error_msg
# Mark unreachable after 3 consecutive failures
if status.consecutive_failures >= 3:
if status.is_reachable: # Only log transition
self._logger.warning(
f"Device {unit_id} marked unreachable after {status.consecutive_failures} failures: {error_msg}"
)
log_device_event(unit_id, "ERROR", "POLL", f"Device marked UNREACHABLE after {status.consecutive_failures} failures: {error_msg}", db)
status.is_reachable = False
else:
self._logger.warning(
f"Poll failed for {unit_id} (attempt {status.consecutive_failures}/3): {error_msg}"
)
log_device_event(unit_id, "WARNING", "POLL", f"Poll failed (attempt {status.consecutive_failures}/3): {error_msg}", db)
db.commit()
def _calculate_sleep_interval(self) -> int:
"""
Calculate the next sleep interval based on all device poll intervals.
Returns a dynamic sleep time that ensures responsive polling:
- Minimum 30 seconds (prevents tight loops)
- Maximum 300 seconds / 5 minutes (ensures reasonable responsiveness for long intervals)
- Generally half the minimum device interval
Returns:
Sleep interval in seconds
"""
db: Session = SessionLocal()
try:
configs = db.query(NL43Config).filter_by(
tcp_enabled=True,
poll_enabled=True
).all()
if not configs:
return 60 # Default sleep when no devices configured
# Get all intervals
intervals = [cfg.poll_interval_seconds or 60 for cfg in configs]
min_interval = min(intervals)
# Use half the minimum interval, but cap between 30-300 seconds
# This allows longer sleep times when polling intervals are long (e.g., hourly)
sleep_time = max(30, min(300, min_interval // 2))
return sleep_time
finally:
db.close()
# Global singleton instance
poller = BackgroundPoller()
+277
View File
@@ -0,0 +1,277 @@
"""
Per-device logging system.
Provides dual output: database entries for structured queries and file logs for backup.
Each device gets its own log file in data/logs/{unit_id}.log with rotation.
"""
import logging
import os
from datetime import datetime, timedelta
from logging.handlers import RotatingFileHandler
from pathlib import Path
from typing import Optional
from sqlalchemy.orm import Session
from app.database import SessionLocal
from app.models import DeviceLog
# Configure base logger
logger = logging.getLogger(__name__)
# Log directory (persisted in Docker volume)
LOG_DIR = Path(os.path.dirname(os.path.dirname(__file__))) / "data" / "logs"
LOG_DIR.mkdir(parents=True, exist_ok=True)
# Per-device file loggers (cached)
_device_file_loggers: dict = {}
# Log retention (days)
LOG_RETENTION_DAYS = int(os.getenv("LOG_RETENTION_DAYS", "7"))
def _get_file_logger(unit_id: str) -> logging.Logger:
"""Get or create a file logger for a specific device."""
if unit_id in _device_file_loggers:
return _device_file_loggers[unit_id]
# Create device-specific logger
device_logger = logging.getLogger(f"device.{unit_id}")
device_logger.setLevel(logging.DEBUG)
# Avoid duplicate handlers
if not device_logger.handlers:
# Create rotating file handler (5 MB max, keep 3 backups)
log_file = LOG_DIR / f"{unit_id}.log"
handler = RotatingFileHandler(
log_file,
maxBytes=5 * 1024 * 1024, # 5 MB
backupCount=3,
encoding="utf-8"
)
handler.setLevel(logging.DEBUG)
# Format: timestamp [LEVEL] [CATEGORY] message
formatter = logging.Formatter(
"%(asctime)s [%(levelname)s] [%(category)s] %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
handler.setFormatter(formatter)
device_logger.addHandler(handler)
# Don't propagate to root logger
device_logger.propagate = False
_device_file_loggers[unit_id] = device_logger
return device_logger
def log_device_event(
unit_id: str,
level: str,
category: str,
message: str,
db: Optional[Session] = None
):
"""
Log an event for a specific device.
Writes to both:
1. Database (DeviceLog table) for structured queries
2. File (data/logs/{unit_id}.log) for backup/debugging
Args:
unit_id: Device identifier
level: Log level (DEBUG, INFO, WARNING, ERROR)
category: Event category (TCP, FTP, POLL, COMMAND, STATE, SYNC)
message: Log message
db: Optional database session (creates one if not provided)
"""
timestamp = datetime.utcnow()
# Write to file log
try:
file_logger = _get_file_logger(unit_id)
log_func = getattr(file_logger, level.lower(), file_logger.info)
# Pass category as extra for formatter
log_func(message, extra={"category": category})
except Exception as e:
logger.warning(f"Failed to write file log for {unit_id}: {e}")
# Write to database
close_db = False
try:
if db is None:
db = SessionLocal()
close_db = True
log_entry = DeviceLog(
unit_id=unit_id,
timestamp=timestamp,
level=level.upper(),
category=category.upper(),
message=message
)
db.add(log_entry)
db.commit()
except Exception as e:
logger.warning(f"Failed to write DB log for {unit_id}: {e}")
if db:
db.rollback()
finally:
if close_db and db:
db.close()
def cleanup_old_logs(retention_days: Optional[int] = None, db: Optional[Session] = None):
"""
Delete log entries older than retention period.
Args:
retention_days: Days to retain (default: LOG_RETENTION_DAYS env var or 7)
db: Optional database session
"""
if retention_days is None:
retention_days = LOG_RETENTION_DAYS
cutoff = datetime.utcnow() - timedelta(days=retention_days)
close_db = False
try:
if db is None:
db = SessionLocal()
close_db = True
deleted = db.query(DeviceLog).filter(DeviceLog.timestamp < cutoff).delete()
db.commit()
if deleted > 0:
logger.info(f"Cleaned up {deleted} log entries older than {retention_days} days")
except Exception as e:
logger.error(f"Failed to cleanup old logs: {e}")
if db:
db.rollback()
finally:
if close_db and db:
db.close()
def get_device_logs(
unit_id: str,
limit: int = 100,
offset: int = 0,
level: Optional[str] = None,
category: Optional[str] = None,
since: Optional[datetime] = None,
db: Optional[Session] = None
) -> list:
"""
Query log entries for a specific device.
Args:
unit_id: Device identifier
limit: Max entries to return (default: 100)
offset: Number of entries to skip (default: 0)
level: Filter by level (DEBUG, INFO, WARNING, ERROR)
category: Filter by category (TCP, FTP, POLL, COMMAND, STATE, SYNC)
since: Filter entries after this timestamp
db: Optional database session
Returns:
List of log entries as dicts
"""
close_db = False
try:
if db is None:
db = SessionLocal()
close_db = True
query = db.query(DeviceLog).filter(DeviceLog.unit_id == unit_id)
if level:
query = query.filter(DeviceLog.level == level.upper())
if category:
query = query.filter(DeviceLog.category == category.upper())
if since:
query = query.filter(DeviceLog.timestamp >= since)
# Order by newest first
query = query.order_by(DeviceLog.timestamp.desc())
# Apply pagination
entries = query.offset(offset).limit(limit).all()
return [
{
"id": e.id,
"timestamp": e.timestamp.isoformat() + "Z",
"level": e.level,
"category": e.category,
"message": e.message
}
for e in entries
]
finally:
if close_db and db:
db.close()
def get_log_stats(unit_id: str, db: Optional[Session] = None) -> dict:
"""
Get log statistics for a device.
Returns:
Dict with counts by level and category
"""
close_db = False
try:
if db is None:
db = SessionLocal()
close_db = True
total = db.query(DeviceLog).filter(DeviceLog.unit_id == unit_id).count()
# Count by level
level_counts = {}
for level in ["DEBUG", "INFO", "WARNING", "ERROR"]:
count = db.query(DeviceLog).filter(
DeviceLog.unit_id == unit_id,
DeviceLog.level == level
).count()
if count > 0:
level_counts[level] = count
# Count by category
category_counts = {}
for category in ["TCP", "FTP", "POLL", "COMMAND", "STATE", "SYNC", "GENERAL"]:
count = db.query(DeviceLog).filter(
DeviceLog.unit_id == unit_id,
DeviceLog.category == category
).count()
if count > 0:
category_counts[category] = count
# Get oldest and newest
oldest = db.query(DeviceLog).filter(
DeviceLog.unit_id == unit_id
).order_by(DeviceLog.timestamp.asc()).first()
newest = db.query(DeviceLog).filter(
DeviceLog.unit_id == unit_id
).order_by(DeviceLog.timestamp.desc()).first()
return {
"total": total,
"by_level": level_counts,
"by_category": category_counts,
"oldest": oldest.timestamp.isoformat() + "Z" if oldest else None,
"newest": newest.timestamp.isoformat() + "Z" if newest else None
}
finally:
if close_db and db:
db.close()
+76 -13
View File
@@ -1,5 +1,6 @@
import os import os
import logging import logging
from contextlib import asynccontextmanager
from fastapi import FastAPI, Request from fastapi import FastAPI, Request
from fastapi.middleware.cors import CORSMiddleware from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import HTMLResponse from fastapi.responses import HTMLResponse
@@ -7,6 +8,7 @@ from fastapi.templating import Jinja2Templates
from app.database import Base, engine from app.database import Base, engine
from app import routers from app import routers
from app.background_poller import poller
# Configure logging # Configure logging
logging.basicConfig( logging.basicConfig(
@@ -23,10 +25,54 @@ logger = logging.getLogger(__name__)
Base.metadata.create_all(bind=engine) Base.metadata.create_all(bind=engine)
logger.info("Database tables initialized") logger.info("Database tables initialized")
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Manage application lifecycle - startup and shutdown events."""
from app.services import _connection_pool
# Startup
logger.info("Starting TCP connection pool cleanup task...")
_connection_pool.start_cleanup()
logger.info("Starting background poller...")
await poller.start()
logger.info("Background poller started")
# Auto-start keepalive live monitors for units configured for 24/7 monitoring
# (monitor_enabled). This is what keeps alerting running unattended across
# restarts — without it a feed only runs while someone has the live view open.
try:
from app.monitor import monitor_manager
from app.database import SessionLocal
from app.models import NL43Config
db = SessionLocal()
try:
units = db.query(NL43Config).filter_by(monitor_enabled=True, tcp_enabled=True).all()
for cfg in units:
m = await monitor_manager.get(cfg.unit_id)
await m.set_keepalive(True)
logger.info(f"Auto-started keepalive monitor for {cfg.unit_id}")
finally:
db.close()
except Exception as e:
logger.error(f"Failed to auto-start monitors: {e}")
yield # Application runs
# Shutdown
logger.info("Stopping background poller...")
await poller.stop()
logger.info("Background poller stopped")
logger.info("Closing TCP connection pool...")
await _connection_pool.close_all()
logger.info("TCP connection pool closed")
app = FastAPI( app = FastAPI(
title="SLMM NL43 Addon", title="SLMM NL43 Addon",
description="Standalone module for NL43 configuration and status APIs", description="Standalone module for NL43 configuration and status APIs with background polling",
version="0.1.0", version="0.4.0",
lifespan=lifespan,
) )
# CORS configuration - use environment variable for allowed origins # CORS configuration - use environment variable for allowed origins
@@ -49,7 +95,12 @@ app.include_router(routers.router)
@app.get("/", response_class=HTMLResponse) @app.get("/", response_class=HTMLResponse)
def index(request: Request): def index(request: Request):
return templates.TemplateResponse("index.html", {"request": request}) return templates.TemplateResponse(request, "index.html")
@app.get("/roster", response_class=HTMLResponse)
def roster(request: Request):
return templates.TemplateResponse(request, "roster.html")
@app.get("/health") @app.get("/health")
@@ -60,10 +111,14 @@ async def health():
@app.get("/health/devices") @app.get("/health/devices")
async def health_devices(): async def health_devices():
"""Enhanced health check that tests device connectivity.""" """Enhanced health check that tests device connectivity.
Uses the connection pool to avoid unnecessary TCP handshakes — if a
cached connection exists and is alive, the device is reachable.
"""
from sqlalchemy.orm import Session from sqlalchemy.orm import Session
from app.database import SessionLocal from app.database import SessionLocal
from app.services import NL43Client from app.services import _connection_pool
from app.models import NL43Config from app.models import NL43Config
db: Session = SessionLocal() db: Session = SessionLocal()
@@ -73,7 +128,7 @@ async def health_devices():
configs = db.query(NL43Config).filter_by(tcp_enabled=True).all() configs = db.query(NL43Config).filter_by(tcp_enabled=True).all()
for cfg in configs: for cfg in configs:
client = NL43Client(cfg.host, cfg.tcp_port, timeout=2.0, ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password) device_key = f"{cfg.host}:{cfg.tcp_port}"
status = { status = {
"unit_id": cfg.unit_id, "unit_id": cfg.unit_id,
"host": cfg.host, "host": cfg.host,
@@ -83,14 +138,22 @@ async def health_devices():
} }
try: try:
# Try to connect (don't send command to avoid rate limiting issues) # Check if pool already has a live connection (zero-cost check)
import asyncio pool_stats = _connection_pool.get_stats()
reader, writer = await asyncio.wait_for( conn_info = pool_stats["connections"].get(device_key)
asyncio.open_connection(cfg.host, cfg.tcp_port), timeout=2.0 if conn_info and conn_info["alive"]:
)
writer.close()
await writer.wait_closed()
status["reachable"] = True status["reachable"] = True
status["source"] = "pool"
else:
# No cached connection — do a lightweight acquire/release
# This opens a connection if needed but keeps it in the pool
import asyncio
reader, writer, from_cache = await _connection_pool.acquire(
device_key, cfg.host, cfg.tcp_port, timeout=2.0
)
await _connection_pool.release(device_key, reader, writer, cfg.host, cfg.tcp_port)
status["reachable"] = True
status["source"] = "cached" if from_cache else "new"
except Exception as e: except Exception as e:
status["error"] = str(type(e).__name__) status["error"] = str(type(e).__name__)
logger.warning(f"Device {cfg.unit_id} health check failed: {e}") logger.warning(f"Device {cfg.unit_id} health check failed: {e}")
+108 -1
View File
@@ -1,4 +1,4 @@
from sqlalchemy import Column, String, DateTime, Boolean, Integer, Text, func from sqlalchemy import Column, String, DateTime, Boolean, Integer, Float, Text, func
from app.database import Base from app.database import Base
@@ -19,6 +19,14 @@ class NL43Config(Base):
ftp_password = Column(String, nullable=True) # FTP login password ftp_password = Column(String, nullable=True) # FTP login password
web_enabled = Column(Boolean, default=False) web_enabled = Column(Boolean, default=False)
# Background polling configuration
poll_interval_seconds = Column(Integer, nullable=True, default=60) # Polling interval (10-3600 seconds)
poll_enabled = Column(Boolean, default=True) # Enable/disable background polling for this device
# Live monitor (fan-out DOD feed). Keepalive runs it 24/7 even with no viewer,
# which is what makes alerting continuous. On by default; toggleable from the UI.
monitor_enabled = Column(Boolean, default=True)
class NL43Status(Base): class NL43Status(Base):
""" """
@@ -37,8 +45,107 @@ class NL43Status(Base):
lmax = Column(String, nullable=True) # Maximum level lmax = Column(String, nullable=True) # Maximum level
lmin = Column(String, nullable=True) # Minimum level lmin = Column(String, nullable=True) # Minimum level
lpeak = Column(String, nullable=True) # Peak level lpeak = Column(String, nullable=True) # Peak level
ln1 = Column(String, nullable=True) # Percentile slot LN1 (configurable; device default L5, contract L1)
ln2 = Column(String, nullable=True) # Percentile slot LN2 (configurable; device default L10)
battery_level = Column(String, nullable=True) battery_level = Column(String, nullable=True)
power_source = Column(String, nullable=True) power_source = Column(String, nullable=True)
sd_remaining_mb = Column(String, nullable=True) sd_remaining_mb = Column(String, nullable=True)
sd_free_ratio = Column(String, nullable=True) sd_free_ratio = Column(String, nullable=True)
raw_payload = Column(Text, nullable=True) raw_payload = Column(Text, nullable=True)
# Background polling status
is_reachable = Column(Boolean, default=True) # Device reachability status
consecutive_failures = Column(Integer, default=0) # Count of consecutive poll failures
last_poll_attempt = Column(DateTime, nullable=True) # Last time background poller attempted to poll
last_success = Column(DateTime, nullable=True) # Last successful poll timestamp
last_error = Column(Text, nullable=True) # Last error message (truncated to 500 chars)
# FTP start time sync tracking
start_time_sync_attempted = Column(Boolean, default=False) # True if FTP sync was attempted for current measurement
class DeviceLog(Base):
"""
Per-device log entries for debugging and audit trail.
Stores events like commands, state changes, errors, and FTP operations.
"""
__tablename__ = "device_logs"
id = Column(Integer, primary_key=True, autoincrement=True)
unit_id = Column(String, index=True, nullable=False)
timestamp = Column(DateTime, default=func.now(), index=True)
level = Column(String, default="INFO") # DEBUG, INFO, WARNING, ERROR
category = Column(String, default="GENERAL") # TCP, FTP, POLL, COMMAND, STATE, SYNC
message = Column(Text, nullable=False)
class AlertRule(Base):
"""A threshold-alert rule evaluated against a unit's live monitor feed.
Source-agnostic: today it runs over the DOD monitor; the same rule transfers
unchanged if a unit's feed is later sourced from FTP intervals.
"""
__tablename__ = "alert_rules"
id = Column(Integer, primary_key=True, autoincrement=True)
unit_id = Column(String, index=True, nullable=False)
name = Column(String, nullable=False, default="Alert")
metric = Column(String, nullable=False, default="lp") # lp/leq/lmax/lmin/lpeak/ln1/ln2
comparison = Column(String, nullable=False, default="above") # above | below
threshold_db = Column(Float, nullable=False)
duration_s = Column(Integer, nullable=False, default=0) # sustained seconds (0 = instant)
clear_margin_db = Column(Float, nullable=False, default=2.0) # hysteresis band
cooldown_s = Column(Integer, nullable=False, default=300) # min seconds between onsets
# Optional time-of-day scoping (local time). schedule_start/end as "HH:MM";
# null = always active. schedule_days = CSV of 0-6 (Mon=0); null = every day.
schedule_start = Column(String, nullable=True)
schedule_end = Column(String, nullable=True)
schedule_days = Column(String, nullable=True)
channels = Column(String, nullable=False, default="log") # CSV: log,email,sms
recipients = Column(Text, nullable=True) # CSV of emails/phones
enabled = Column(Boolean, default=True)
created_at = Column(DateTime, default=func.now())
class AlertEvent(Base):
"""A fired alert (onset → clear), for history / inbox / acknowledgement."""
__tablename__ = "alert_events"
id = Column(Integer, primary_key=True, autoincrement=True)
rule_id = Column(Integer, index=True, nullable=False)
unit_id = Column(String, index=True, nullable=False)
rule_name = Column(String, nullable=True)
metric = Column(String, nullable=False)
threshold_db = Column(Float, nullable=False)
onset_at = Column(DateTime, default=func.now(), index=True)
onset_value = Column(Float, nullable=True)
peak_value = Column(Float, nullable=True)
clear_at = Column(DateTime, nullable=True)
status = Column(String, default="active") # active | cleared
acknowledged_at = Column(DateTime, nullable=True)
acknowledged_by = Column(String, nullable=True)
notes = Column(Text, nullable=True)
class NL43Reading(Base):
"""Downsampled time-series of live-monitor readings, for the live-chart
backfill (so a viewer sees recent trend on open, not a blank chart).
Viewing only — NOT the report source. Reports use the device's authoritative
FTP .rnd intervals. This is a short, capped trail (one row/minute, pruned to
a retention window) fed by the monitor's keepalive poll loop.
"""
__tablename__ = "nl43_readings"
id = Column(Integer, primary_key=True, autoincrement=True)
unit_id = Column(String, index=True, nullable=False)
timestamp = Column(DateTime, default=func.now(), index=True)
lp = Column(String, nullable=True)
leq = Column(String, nullable=True)
lmax = Column(String, nullable=True)
ln1 = Column(String, nullable=True)
ln2 = Column(String, nullable=True)
+322
View File
@@ -0,0 +1,322 @@
"""
Per-device live monitor (fan-out hub).
ONE DOD poll loop per device, broadcast to many subscribers:
- browser WebSocket clients (live view) they no longer each open their own
device stream, so the NL43's single-connection limit stops causing the
"second viewer sees nothing" contention.
- the alert evaluator (threshold alerts), which can keep a device's feed running
even with no browser attached.
- persistence (each snapshot is written to NL43Status, like the poller does).
The device's one TCP connection is respected: every poll goes through the same
per-device lock + connection pool in services.py, so the monitor, the background
poller, and on-demand commands all serialize safely.
"""
import asyncio
import logging
import os
from datetime import datetime
from typing import Dict, Optional, Set
from app.database import SessionLocal
from app.models import NL43Config, NL43Status
from app.services import NL43Client, persist_snapshot
from app.alerts import alert_evaluator
logger = logging.getLogger(__name__)
# Extra idle between DOD polls WHEN A BROWSER IS WATCHING. The 1s device rate-limit
# already paces consecutive DOD? commands, so this just needs to be small — the
# rate-limit is the real floor (~1.25s/poll effective).
MONITOR_POLL_INTERVAL = float(os.getenv("MONITOR_POLL_INTERVAL", "0.25"))
# Idle cadence when NO browser is subscribed and the feed is only kept alive for
# alerting. Same data, ~8x fewer polls -> ~8x less cellular traffic on a metered
# SIM (~1 GB/device/month at full rate -> ~125 MB). NOTE: this also sets the alert
# sampling resolution when nobody is watching, so keep it <= the smallest alert
# duration_s you rely on (default 10s comfortably catches a "sustained 30/60s" rule).
MONITOR_IDLE_POLL_INTERVAL = float(os.getenv("MONITOR_IDLE_POLL_INTERVAL", "10"))
# Exponential backoff once the device is unreachable, so a powered-off / asleep /
# out-of-signal device stops churning reconnects every cycle (log spam + a trickle
# of wasted cellular data on failed SYNs). delay = min(BASE * 2**(fails-1), MAX),
# reset to full-rate on the first good poll. While a browser is actively watching we
# cap the backoff lower (WATCHED_MAX) so a recovery surfaces quickly for the viewer.
MONITOR_BACKOFF_BASE_S = float(os.getenv("MONITOR_BACKOFF_BASE_S", "1"))
MONITOR_BACKOFF_MAX_S = float(os.getenv("MONITOR_BACKOFF_MAX_S", "60"))
MONITOR_BACKOFF_WATCHED_MAX_S = float(os.getenv("MONITOR_BACKOFF_WATCHED_MAX_S", "5"))
# How often to refresh the run state (Measure?). It changes rarely, so we cache it
# and skip that second rate-limited command on most polls — roughly halving the
# per-update latency (~2.5s -> ~1.3s).
MONITOR_STATE_REFRESH_S = float(os.getenv("MONITOR_STATE_REFRESH_S", "30"))
# Downsampled trail for the live-chart backfill: store one reading per
# TRAIL_SAMPLE_S and keep TRAIL_RETENTION_HOURS of it (pruned). Viewing only —
# reports use the device's FTP .rnd data, not this.
TRAIL_SAMPLE_S = float(os.getenv("MONITOR_TRAIL_SAMPLE_S", "60"))
TRAIL_RETENTION_HOURS = float(os.getenv("MONITOR_TRAIL_RETENTION_HOURS", "24"))
# If nothing has been broadcast in this many seconds (e.g. device offline and
# silent), send a keepalive frame so reverse proxies don't drop the idle WS.
MONITOR_HEARTBEAT_S = float(os.getenv("MONITOR_HEARTBEAT_S", "25"))
def _snapshot_payload(snap, unit_id: str, measurement_start_time) -> dict:
"""Build the broadcast payload — same shape as the DRD stream, but DOD-sourced
so it carries ln1/ln2 (which DRD cannot)."""
return {
"unit_id": unit_id,
"timestamp": datetime.utcnow().isoformat(),
"measurement_state": snap.measurement_state,
"measurement_start_time": measurement_start_time,
"counter": snap.counter,
"lp": snap.lp,
"leq": snap.leq,
"lmax": snap.lmax,
"lmin": snap.lmin,
"lpeak": snap.lpeak,
"ln1": snap.ln1,
"ln2": snap.ln2,
"raw_payload": snap.raw_payload,
}
class DeviceMonitor:
"""Owns a single DOD poll loop for one device and fans each snapshot out to
all subscribers. Runs while it has at least one browser subscriber OR the
server-side keep-alive (alerting) flag is set."""
def __init__(self, unit_id: str):
self.unit_id = unit_id
self._subscribers: Set[asyncio.Queue] = set()
self._keepalive = False
self._task: Optional[asyncio.Task] = None
self._lock = asyncio.Lock()
self._last_payload: Optional[dict] = None # replayed to new subscribers
self._consec_fail = 0
self._reachable = True # last broadcast reachability (for transition frames)
self._cached_state: Optional[str] = None # run state, refreshed periodically
self._last_state_refresh = 0.0
self._last_trail_store = 0.0 # downsample throttle for the backfill trail
@property
def running(self) -> bool:
return self._task is not None and not self._task.done()
def subscriber_count(self) -> int:
return len(self._subscribers)
def _has_demand(self) -> bool:
return bool(self._subscribers) or self._keepalive
def _ensure_task(self) -> None:
if self._task is None or self._task.done():
self._task = asyncio.create_task(self._run())
async def subscribe(self) -> asyncio.Queue:
q: asyncio.Queue = asyncio.Queue(maxsize=5)
async with self._lock:
self._subscribers.add(q)
# Replay the last frame so a client connecting mid-stream sees data
# (or the current 'unreachable' state) immediately, not after a poll.
if self._last_payload is not None:
try:
q.put_nowait(self._last_payload)
except asyncio.QueueFull:
pass
self._ensure_task()
return q
async def unsubscribe(self, q: asyncio.Queue) -> None:
async with self._lock:
self._subscribers.discard(q)
async def set_keepalive(self, on: bool) -> None:
async with self._lock:
self._keepalive = on
if on:
self._ensure_task()
async def _run(self) -> None:
logger.info(f"[MONITOR] {self.unit_id}: feed started")
loop = asyncio.get_running_loop()
last_send = loop.time()
try:
while self._has_demand():
snap, mst = await self._poll_once()
if snap is not None:
if not self._reachable:
# Recovered from an outage — clear the connectivity alert.
try:
await alert_evaluator.device_online(self.unit_id)
except Exception as e:
logger.warning(f"[MONITOR] {self.unit_id}: online alert failed: {e}")
self._consec_fail = 0
self._reachable = True
payload = _snapshot_payload(snap, self.unit_id, mst)
payload["feed_status"] = "ok"
self._broadcast(payload)
last_send = loop.time()
try:
await alert_evaluator.evaluate(self.unit_id, snap)
except Exception as e:
logger.warning(f"[MONITOR] {self.unit_id}: alert eval failed: {e}")
else:
# Tell clients the device went offline — once, on transition, after a
# few failures so a momentary blip doesn't flap the UI. Same edge
# raises the device-offline alert.
self._consec_fail += 1
if self._reachable and self._consec_fail >= 3:
self._reachable = False
self._broadcast({
"unit_id": self.unit_id,
"timestamp": datetime.utcnow().isoformat(),
"feed_status": "unreachable",
})
last_send = loop.time()
try:
await alert_evaluator.device_offline(self.unit_id)
except Exception as e:
logger.warning(f"[MONITOR] {self.unit_id}: offline alert failed: {e}")
# Heartbeat: during quiet/offline stretches, send a keepalive so an
# idle WS isn't dropped by a reverse proxy. Not cached (new subscribers
# should still get the last real frame, not a heartbeat).
if loop.time() - last_send >= MONITOR_HEARTBEAT_S:
self._broadcast({
"unit_id": self.unit_id,
"timestamp": datetime.utcnow().isoformat(),
"feed_status": "ok" if self._reachable else "unreachable",
"heartbeat": True,
}, cache=False)
last_send = loop.time()
await asyncio.sleep(self._next_delay())
finally:
logger.info(f"[MONITOR] {self.unit_id}: feed stopped")
def _next_delay(self) -> float:
"""Inter-poll delay: exponential backoff while unreachable, full-rate while a
browser is watching, relaxed cadence when the feed is keepalive-only."""
if self._consec_fail > 0:
shift = min(self._consec_fail - 1, 6) # cap growth at 2**6 = 64x base
delay = min(MONITOR_BACKOFF_BASE_S * (2 ** shift), MONITOR_BACKOFF_MAX_S)
if self._subscribers:
delay = min(delay, MONITOR_BACKOFF_WATCHED_MAX_S)
return delay
if self._subscribers:
return MONITOR_POLL_INTERVAL # a browser is watching — smooth chart
return MONITOR_IDLE_POLL_INTERVAL # keepalive-only (alerting) — save data
async def _poll_once(self):
"""One DOD poll: read, persist, return (snapshot, measurement_start_iso)."""
db = SessionLocal()
try:
cfg = db.query(NL43Config).filter_by(unit_id=self.unit_id).first()
if not cfg or not cfg.tcp_enabled:
return None, None
client = NL43Client(
cfg.host, cfg.tcp_port,
ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password,
ftp_port=cfg.ftp_port or 21,
)
# Refresh the run state only every MONITOR_STATE_REFRESH_S; reuse the
# cached state otherwise so most polls send just DOD? (one rate-limited
# command) instead of DOD? + Measure?.
now = asyncio.get_running_loop().time()
refresh_state = (self._cached_state is None
or now - self._last_state_refresh >= MONITOR_STATE_REFRESH_S)
snap = await client.request_dod(
measurement_state=None if refresh_state else self._cached_state
)
if refresh_state:
self._cached_state = snap.measurement_state
self._last_state_refresh = now
snap.unit_id = self.unit_id
persist_snapshot(snap, db)
db.commit()
# Append to the downsampled backfill trail (~one row per TRAIL_SAMPLE_S).
if now - self._last_trail_store >= TRAIL_SAMPLE_S:
self._last_trail_store = now
self._store_trail(snap, db)
status = db.query(NL43Status).filter_by(unit_id=self.unit_id).first()
mst = (status.measurement_start_time.isoformat()
if status and status.measurement_start_time else None)
return snap, mst
except Exception as e:
logger.warning(f"[MONITOR] {self.unit_id}: poll failed: {e}")
return None, None
finally:
db.close()
def _store_trail(self, snap, db) -> None:
"""Append one downsampled reading to the backfill trail and prune old rows."""
from datetime import datetime, timedelta
from app.models import NL43Reading
try:
db.add(NL43Reading(
unit_id=self.unit_id, timestamp=datetime.utcnow(),
lp=snap.lp, leq=snap.leq, lmax=snap.lmax, ln1=snap.ln1, ln2=snap.ln2,
))
cutoff = datetime.utcnow() - timedelta(hours=TRAIL_RETENTION_HOURS)
db.query(NL43Reading).filter(
NL43Reading.unit_id == self.unit_id,
NL43Reading.timestamp < cutoff,
).delete()
db.commit()
except Exception as e:
logger.warning(f"[MONITOR] {self.unit_id}: trail store failed: {e}")
def _broadcast(self, payload: dict, cache: bool = True) -> None:
if cache:
self._last_payload = payload # replayed to new subscribers
for q in list(self._subscribers):
try:
q.put_nowait(payload)
except asyncio.QueueFull:
# Slow consumer — drop this frame rather than stall the whole feed.
pass
class MonitorManager:
"""Registry of per-device monitors (one per unit_id)."""
def __init__(self):
self._monitors: Dict[str, DeviceMonitor] = {}
self._lock = asyncio.Lock()
async def get(self, unit_id: str) -> DeviceMonitor:
async with self._lock:
m = self._monitors.get(unit_id)
if m is None:
m = DeviceMonitor(unit_id)
self._monitors[unit_id] = m
return m
def is_active(self, unit_id: str) -> bool:
"""True if this unit has a running monitor feed (so the background poller
can skip it the monitor already polls it more often)."""
m = self._monitors.get(unit_id)
return m is not None and m.running
def status(self) -> dict:
return {
uid: {
"running": m.running,
"subscribers": m.subscriber_count(),
"keepalive": m._keepalive,
"reachable": m._reachable,
# what cadence the loop is currently using, for observability
"mode": ("backoff" if m._consec_fail > 0
else "watched" if m._subscribers
else "idle"),
}
for uid, m in self._monitors.items()
}
# Module-level singleton
monitor_manager = MonitorManager()
+1025 -62
View File
File diff suppressed because it is too large Load Diff
+954 -91
View File
File diff suppressed because it is too large Load Diff
+67
View File
@@ -0,0 +1,67 @@
# SLMM Archive
This directory contains legacy scripts that are no longer needed for normal operation but are preserved for reference.
## Legacy Migrations (`legacy_migrations/`)
These migration scripts were used during SLMM development (v0.1.x) to incrementally add database fields. They are **no longer needed** because:
1. **Fresh databases** get the complete schema automatically from `app/models.py`
2. **Existing databases** should already have these fields from previous runs
3. **Current migration** is `migrate_add_polling_fields.py` (v0.2.0) in the parent directory
### Archived Migration Files
- `migrate_add_counter.py` - Added `counter` field to NL43Status
- `migrate_add_measurement_start_time.py` - Added `measurement_start_time` field
- `migrate_add_ftp_port.py` - Added `ftp_port` field to NL43Config
- `migrate_field_names.py` - Renamed fields for consistency (one-time fix)
- `migrate_revert_field_names.py` - Rollback for the rename migration
**Do not delete** - These provide historical context for database schema evolution.
---
## Legacy Tools
### `nl43_dod_poll.py`
Manual polling script that queries a single NL-43 device for DOD (Device On-Demand) data.
**Status**: Replaced by background polling system in v0.2.0
**Why archived**:
- Background poller (`app/background_poller.py`) now handles continuous polling automatically
- No need for manual polling scripts
- Kept for reference in case manual querying is needed for debugging
**How to use** (if needed):
```bash
cd /home/serversdown/tmi/slmm/archive
python3 nl43_dod_poll.py <host> <port> <unit_id>
```
---
## Active Scripts (Still in Parent Directory)
These scripts are **actively used** and documented in the main README:
### Migrations
- `migrate_add_polling_fields.py` - **v0.2.0 migration** - Adds background polling fields
- `migrate_add_ftp_credentials.py` - **Legacy FTP migration** - Adds FTP auth fields
### Testing
- `test_polling.sh` - Comprehensive test suite for background polling features
- `test_settings_endpoint.py` - Tests device settings API
- `test_sleep_mode_auto_disable.py` - Tests automatic sleep mode handling
### Utilities
- `set_ftp_credentials.py` - Command-line tool to set FTP credentials for a device
---
## Version History
- **v0.2.0** (2026-01-15) - Background polling system added, manual polling scripts archived
- **v0.1.0** (2025-12-XX) - Initial release with incremental migrations
+1 -1
View File
@@ -483,7 +483,7 @@ POST /{unit_id}/ftp/enable
``` ```
Enables FTP server on the device. Enables FTP server on the device.
**Note:** FTP and TCP are mutually exclusive. Enabling FTP will temporarily disable TCP control. **Note:** ~~FTP and TCP are mutually exclusive. Enabling FTP will temporarily disable TCP control.~~ As of v0.2.0, FTP and TCP are working fine in tandem. Just dont spam them a bunch.
### Disable FTP ### Disable FTP
``` ```
+246
View File
@@ -0,0 +1,246 @@
# SLMM Roster Management
The SLMM standalone application now includes a roster management interface for viewing and configuring all Sound Level Meter devices.
## Features
### Web Interface
Access the roster at: **http://localhost:8100/roster**
The roster page provides:
- **Device List Table**: View all configured SLMs with their connection details
- **Real-time Status**: See device connectivity status (Online/Offline/Stale)
- **Add Device**: Create new device configurations with a user-friendly modal form
- **Edit Device**: Modify existing device configurations
- **Delete Device**: Remove device configurations (does not affect physical devices)
- **Test Connection**: Run diagnostics on individual devices
### Table Columns
| Column | Description |
|--------|-------------|
| Unit ID | Unique identifier for the device |
| Host / IP | Device IP address or hostname |
| TCP Port | TCP control port (default: 2255) |
| FTP Port | FTP file transfer port (default: 21) |
| TCP | Whether TCP control is enabled |
| FTP | Whether FTP file transfer is enabled |
| Polling | Whether background polling is enabled |
| Status | Device connectivity status (Online/Offline/Stale) |
| Actions | Test, Edit, Delete buttons |
### Status Indicators
- **Online** (green): Device responded within the last 5 minutes
- **Stale** (yellow): Device hasn't responded recently but was seen before
- **Offline** (red): Device is unreachable or has consecutive failures
- **Unknown** (gray): No status data available yet
## API Endpoints
### List All Devices
```bash
GET /api/nl43/roster
```
Returns all configured devices with their status information.
**Response:**
```json
{
"status": "ok",
"devices": [
{
"unit_id": "SLM-43-01",
"host": "192.168.1.100",
"tcp_port": 2255,
"ftp_port": 21,
"tcp_enabled": true,
"ftp_enabled": true,
"ftp_username": "USER",
"ftp_password": "0000",
"web_enabled": false,
"poll_enabled": true,
"poll_interval_seconds": 60,
"status": {
"last_seen": "2026-01-16T20:00:00",
"measurement_state": "Start",
"is_reachable": true,
"consecutive_failures": 0,
"last_success": "2026-01-16T20:00:00",
"last_error": null
}
}
],
"total": 1
}
```
### Create New Device
```bash
POST /api/nl43/roster
Content-Type: application/json
{
"unit_id": "SLM-43-01",
"host": "192.168.1.100",
"tcp_port": 2255,
"ftp_port": 21,
"tcp_enabled": true,
"ftp_enabled": false,
"poll_enabled": true,
"poll_interval_seconds": 60
}
```
**Required Fields:**
- `unit_id`: Unique device identifier
- `host`: IP address or hostname
**Optional Fields:**
- `tcp_port`: TCP control port (default: 2255)
- `ftp_port`: FTP port (default: 21)
- `tcp_enabled`: Enable TCP control (default: true)
- `ftp_enabled`: Enable FTP transfers (default: false)
- `ftp_username`: FTP username (only if ftp_enabled)
- `ftp_password`: FTP password (only if ftp_enabled)
- `poll_enabled`: Enable background polling (default: true)
- `poll_interval_seconds`: Polling interval 10-3600 seconds (default: 60)
**Response:**
```json
{
"status": "ok",
"message": "Device SLM-43-01 created successfully",
"data": {
"unit_id": "SLM-43-01",
"host": "192.168.1.100",
"tcp_port": 2255,
"tcp_enabled": true,
"ftp_enabled": false,
"poll_enabled": true,
"poll_interval_seconds": 60
}
}
```
### Update Device
```bash
PUT /api/nl43/{unit_id}/config
Content-Type: application/json
{
"host": "192.168.1.101",
"tcp_port": 2255,
"poll_interval_seconds": 120
}
```
All fields are optional. Only include fields you want to update.
### Delete Device
```bash
DELETE /api/nl43/{unit_id}/config
```
Removes the device configuration and associated status data. Does not affect the physical device.
**Response:**
```json
{
"status": "ok",
"message": "Deleted device SLM-43-01"
}
```
## Usage Examples
### Via Web Interface
1. Navigate to http://localhost:8100/roster
2. Click "Add Device" to create a new configuration
3. Fill in the device details (unit ID, IP address, ports)
4. Configure TCP, FTP, and polling settings
5. Click "Save Device"
6. Use "Test" button to verify connectivity
7. Edit or delete devices as needed
### Via API (curl)
**Add a new device:**
```bash
curl -X POST http://localhost:8100/api/nl43/roster \
-H "Content-Type: application/json" \
-d '{
"unit_id": "slm-site-a",
"host": "192.168.1.100",
"tcp_port": 2255,
"tcp_enabled": true,
"ftp_enabled": true,
"ftp_username": "USER",
"ftp_password": "0000",
"poll_enabled": true,
"poll_interval_seconds": 60
}'
```
**Update device host:**
```bash
curl -X PUT http://localhost:8100/api/nl43/slm-site-a/config \
-H "Content-Type: application/json" \
-d '{"host": "192.168.1.101"}'
```
**Delete device:**
```bash
curl -X DELETE http://localhost:8100/api/nl43/slm-site-a/config
```
**List all devices:**
```bash
curl http://localhost:8100/api/nl43/roster | python3 -m json.tool
```
## Integration with Terra-View
When SLMM is used as a module within Terra-View:
1. Terra-View manages device configurations in its own database
2. Terra-View syncs configurations to SLMM via `PUT /api/nl43/{unit_id}/config`
3. Terra-View can query device status via `GET /api/nl43/{unit_id}/status`
4. SLMM's roster page can be used for standalone testing and diagnostics
## Background Polling
Devices with `poll_enabled: true` are automatically polled at their configured interval:
- Polls device status every `poll_interval_seconds` (10-3600 seconds)
- Updates `NL43Status` table with latest measurements
- Tracks device reachability and failure counts
- Provides real-time status updates in the roster
**Note**: Polling respects the NL43 protocol's 1-second rate limit between commands.
## Validation
The roster system validates:
- **Unit ID**: Must be unique across all devices
- **Host**: Valid IP address or hostname format
- **Ports**: Must be between 1-65535
- **Poll Interval**: Must be between 10-3600 seconds
- **Duplicate Check**: Returns 409 Conflict if unit_id already exists
## Notes
- Deleting a device from the roster does NOT affect the physical device
- Device configurations are stored in the SLMM database (`data/slmm.db`)
- Status information is updated by the background polling system
- The roster page auto-refreshes status indicators
- Test button runs full diagnostics (connectivity, TCP, FTP if enabled)
+26
View File
@@ -0,0 +1,26 @@
# SLMM Feature Documentation
This directory contains detailed documentation for specific SLMM features and enhancements.
## Feature Documents
### FEATURE_SUMMARY.md
Overview of all major features in SLMM.
### SETTINGS_ENDPOINT.md
Documentation of the device settings endpoint and verification system.
### TIMEZONE_CONFIGURATION.md
Timezone handling and configuration for SLMM timestamps.
### SLEEP_MODE_AUTO_DISABLE.md
Automatic sleep mode wake-up system for background polling.
### UI_UPDATE.md
UI/UX improvements and interface updates.
## Related Documentation
- [../README.md](../../README.md) - Main SLMM documentation
- [../CHANGELOG.md](../../CHANGELOG.md) - Version history
- [../API.md](../../API.md) - Complete API reference
+73
View File
@@ -0,0 +1,73 @@
#!/usr/bin/env python3
"""
Database migration: Add device_logs table.
This table stores per-device log entries for debugging and audit trail.
Run this once to add the new table.
"""
import sqlite3
import os
# Path to the SLMM database
DB_PATH = os.path.join(os.path.dirname(__file__), "data", "slmm.db")
def migrate():
print(f"Adding device_logs table to: {DB_PATH}")
if not os.path.exists(DB_PATH):
print("Database does not exist yet. Table will be created automatically on first run.")
return
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
try:
# Check if table already exists
cursor.execute("""
SELECT name FROM sqlite_master
WHERE type='table' AND name='device_logs'
""")
if cursor.fetchone():
print("✓ device_logs table already exists, no migration needed")
return
# Create the table
print("Creating device_logs table...")
cursor.execute("""
CREATE TABLE device_logs (
id INTEGER PRIMARY KEY AUTOINCREMENT,
unit_id VARCHAR NOT NULL,
timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
level VARCHAR DEFAULT 'INFO',
category VARCHAR DEFAULT 'GENERAL',
message TEXT NOT NULL
)
""")
# Create indexes for efficient querying
print("Creating indexes...")
cursor.execute("CREATE INDEX ix_device_logs_unit_id ON device_logs (unit_id)")
cursor.execute("CREATE INDEX ix_device_logs_timestamp ON device_logs (timestamp)")
conn.commit()
print("✓ Created device_logs table with indexes")
# Verify
cursor.execute("""
SELECT name FROM sqlite_master
WHERE type='table' AND name='device_logs'
""")
if not cursor.fetchone():
raise Exception("device_logs table was not created successfully")
print("✓ Migration completed successfully")
finally:
conn.close()
if __name__ == "__main__":
migrate()
+58
View File
@@ -0,0 +1,58 @@
#!/usr/bin/env python3
"""
Migration script to add ln1 and ln2 percentile columns to the nl43_status table.
The NL-43 DOD response carries percentile slots LN1-LN5; the live SLM display
(Terra-View) shows two of them (default L1/L10). This adds storage for the two
surfaced slots. Run once per database to update existing schema.
"""
import sqlite3
import sys
from pathlib import Path
DB_PATH = Path(__file__).parent / "data" / "slmm.db"
def migrate():
"""Add ln1 and ln2 columns to the nl43_status table."""
if not DB_PATH.exists():
print(f"Database not found at {DB_PATH}")
print("No migration needed - database will be created with new schema")
return
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
try:
cursor.execute("PRAGMA table_info(nl43_status)")
columns = [row[1] for row in cursor.fetchall()]
if "ln1" in columns and "ln2" in columns:
print("✓ ln1/ln2 columns already exist, no migration needed")
return
if "ln1" not in columns:
print("Adding ln1 column...")
cursor.execute("ALTER TABLE nl43_status ADD COLUMN ln1 TEXT")
print("✓ Added ln1 column")
if "ln2" not in columns:
print("Adding ln2 column...")
cursor.execute("ALTER TABLE nl43_status ADD COLUMN ln2 TEXT")
print("✓ Added ln2 column")
conn.commit()
print("\n✓ Migration completed successfully!")
except Exception as e:
conn.rollback()
print(f"✗ Migration failed: {e}", file=sys.stderr)
sys.exit(1)
finally:
conn.close()
if __name__ == "__main__":
migrate()
+48
View File
@@ -0,0 +1,48 @@
#!/usr/bin/env python3
"""
Migration: add monitor_enabled column to nl43_config.
Controls whether the live fan-out DOD monitor is kept alive 24/7 for a unit
(which is what makes alerting continuous). Defaults to enabled. Run once per DB.
"""
import sqlite3
import sys
from pathlib import Path
DB_PATH = Path(__file__).parent / "data" / "slmm.db"
def migrate():
if not DB_PATH.exists():
print(f"Database not found at {DB_PATH}")
print("No migration needed - database will be created with new schema")
return
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
try:
cursor.execute("PRAGMA table_info(nl43_config)")
columns = [row[1] for row in cursor.fetchall()]
if "monitor_enabled" in columns:
print("✓ monitor_enabled column already exists, no migration needed")
return
print("Adding monitor_enabled column (default enabled)...")
# SQLite stores booleans as 0/1; default 1 = enabled.
cursor.execute("ALTER TABLE nl43_config ADD COLUMN monitor_enabled BOOLEAN DEFAULT 1")
conn.commit()
print("✓ Added monitor_enabled column")
print("\n✓ Migration completed successfully!")
except Exception as e:
conn.rollback()
print(f"✗ Migration failed: {e}", file=sys.stderr)
sys.exit(1)
finally:
conn.close()
if __name__ == "__main__":
migrate()
+136
View File
@@ -0,0 +1,136 @@
#!/usr/bin/env python3
"""
Migration script to add polling-related fields to nl43_config and nl43_status tables.
Adds to nl43_config:
- poll_interval_seconds (INTEGER, default 60)
- poll_enabled (BOOLEAN, default 1/True)
Adds to nl43_status:
- is_reachable (BOOLEAN, default 1/True)
- consecutive_failures (INTEGER, default 0)
- last_poll_attempt (DATETIME, nullable)
- last_success (DATETIME, nullable)
- last_error (TEXT, nullable)
Usage:
python migrate_add_polling_fields.py
"""
import sqlite3
import sys
from pathlib import Path
def migrate():
db_path = Path("data/slmm.db")
if not db_path.exists():
print(f"❌ Database not found at {db_path}")
print(" Run this script from the slmm directory")
return False
try:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Check nl43_config columns
cursor.execute("PRAGMA table_info(nl43_config)")
config_columns = [row[1] for row in cursor.fetchall()]
# Check nl43_status columns
cursor.execute("PRAGMA table_info(nl43_status)")
status_columns = [row[1] for row in cursor.fetchall()]
changes_made = False
# Add nl43_config columns
if "poll_interval_seconds" not in config_columns:
print("Adding poll_interval_seconds to nl43_config...")
cursor.execute("""
ALTER TABLE nl43_config
ADD COLUMN poll_interval_seconds INTEGER DEFAULT 60
""")
changes_made = True
else:
print("✓ poll_interval_seconds already exists in nl43_config")
if "poll_enabled" not in config_columns:
print("Adding poll_enabled to nl43_config...")
cursor.execute("""
ALTER TABLE nl43_config
ADD COLUMN poll_enabled BOOLEAN DEFAULT 1
""")
changes_made = True
else:
print("✓ poll_enabled already exists in nl43_config")
# Add nl43_status columns
if "is_reachable" not in status_columns:
print("Adding is_reachable to nl43_status...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN is_reachable BOOLEAN DEFAULT 1
""")
changes_made = True
else:
print("✓ is_reachable already exists in nl43_status")
if "consecutive_failures" not in status_columns:
print("Adding consecutive_failures to nl43_status...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN consecutive_failures INTEGER DEFAULT 0
""")
changes_made = True
else:
print("✓ consecutive_failures already exists in nl43_status")
if "last_poll_attempt" not in status_columns:
print("Adding last_poll_attempt to nl43_status...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN last_poll_attempt DATETIME
""")
changes_made = True
else:
print("✓ last_poll_attempt already exists in nl43_status")
if "last_success" not in status_columns:
print("Adding last_success to nl43_status...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN last_success DATETIME
""")
changes_made = True
else:
print("✓ last_success already exists in nl43_status")
if "last_error" not in status_columns:
print("Adding last_error to nl43_status...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN last_error TEXT
""")
changes_made = True
else:
print("✓ last_error already exists in nl43_status")
if changes_made:
conn.commit()
print("\n✓ Migration completed successfully")
print(" Added polling-related fields to nl43_config and nl43_status")
else:
print("\n✓ All polling fields already exist - no changes needed")
conn.close()
return True
except Exception as e:
print(f"❌ Migration failed: {e}")
return False
if __name__ == "__main__":
success = migrate()
sys.exit(0 if success else 1)
+60
View File
@@ -0,0 +1,60 @@
#!/usr/bin/env python3
"""
Database migration: Add start_time_sync_attempted field to nl43_status table.
This field tracks whether FTP sync has been attempted for the current measurement,
preventing repeated sync attempts when FTP fails.
Run this once to add the new column.
"""
import sqlite3
import os
# Path to the SLMM database
DB_PATH = os.path.join(os.path.dirname(__file__), "data", "slmm.db")
def migrate():
print(f"Adding start_time_sync_attempted field to: {DB_PATH}")
if not os.path.exists(DB_PATH):
print("Database does not exist yet. Column will be created automatically.")
return
conn = sqlite3.connect(DB_PATH)
cursor = conn.cursor()
try:
# Check if column already exists
cursor.execute("PRAGMA table_info(nl43_status)")
columns = [col[1] for col in cursor.fetchall()]
if 'start_time_sync_attempted' in columns:
print("✓ start_time_sync_attempted column already exists, no migration needed")
return
# Add the column
print("Adding start_time_sync_attempted column...")
cursor.execute("""
ALTER TABLE nl43_status
ADD COLUMN start_time_sync_attempted BOOLEAN DEFAULT 0
""")
conn.commit()
print("✓ Added start_time_sync_attempted column")
# Verify
cursor.execute("PRAGMA table_info(nl43_status)")
columns = [col[1] for col in cursor.fetchall()]
if 'start_time_sync_attempted' not in columns:
raise Exception("start_time_sync_attempted column was not added successfully")
print("✓ Migration completed successfully")
finally:
conn.close()
if __name__ == "__main__":
migrate()
+250 -5
View File
@@ -31,6 +31,11 @@
<body> <body>
<h1>SLMM NL43 Standalone</h1> <h1>SLMM NL43 Standalone</h1>
<p>Configure a unit (host/port), then use controls to Start/Stop and fetch live status.</p> <p>Configure a unit (host/port), then use controls to Start/Stop and fetch live status.</p>
<p style="margin-bottom: 16px;">
<a href="/roster" style="color: #0969da; text-decoration: none; font-weight: 600;">📊 View Device Roster</a>
<span style="margin: 0 8px; color: #d0d7de;">|</span>
<a href="/docs" style="color: #0969da; text-decoration: none;">API Documentation</a>
</p>
<fieldset> <fieldset>
<legend>🔍 Connection Diagnostics</legend> <legend>🔍 Connection Diagnostics</legend>
@@ -40,13 +45,34 @@
</fieldset> </fieldset>
<fieldset> <fieldset>
<legend>Unit Config</legend> <legend>Unit Selection & Config</legend>
<div style="display: flex; gap: 8px; align-items: flex-end; margin-bottom: 12px;">
<div style="flex: 1;">
<label>Select Device</label>
<select id="deviceSelector" onchange="loadSelectedDevice()" style="width: 100%; padding: 8px; margin-bottom: 0;">
<option value="">-- Select a device --</option>
</select>
</div>
<button onclick="refreshDeviceList()" style="padding: 8px 12px;">↻ Refresh</button>
</div>
<div style="padding: 12px; background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 4px; margin-bottom: 12px;">
<div style="display: flex; gap: 16px;">
<div style="flex: 1;">
<label>Unit ID</label> <label>Unit ID</label>
<input id="unitId" value="nl43-1" /> <input id="unitId" value="nl43-1" />
</div>
<div style="flex: 2;">
<label>Host</label> <label>Host</label>
<input id="host" value="127.0.0.1" /> <input id="host" value="127.0.0.1" />
<label>Port</label> </div>
<input id="port" type="number" value="80" /> <div style="flex: 1;">
<label>TCP Port</label>
<input id="port" type="number" value="2255" />
</div>
</div>
</div>
<div style="margin: 12px 0;"> <div style="margin: 12px 0;">
<label style="display: inline-flex; align-items: center; margin-right: 16px;"> <label style="display: inline-flex; align-items: center; margin-right: 16px;">
@@ -66,8 +92,10 @@
<input id="ftpPassword" type="password" value="0000" /> <input id="ftpPassword" type="password" value="0000" />
</div> </div>
<button onclick="saveConfig()" style="margin-top: 12px;">Save Config</button> <div style="margin-top: 12px;">
<button onclick="saveConfig()">Save Config</button>
<button onclick="loadConfig()">Load Config</button> <button onclick="loadConfig()">Load Config</button>
</div>
</fieldset> </fieldset>
<fieldset> <fieldset>
@@ -148,6 +176,7 @@
let ws = null; let ws = null;
let streamUpdateCount = 0; let streamUpdateCount = 0;
let availableDevices = [];
function log(msg) { function log(msg) {
logEl.textContent += msg + "\n"; logEl.textContent += msg + "\n";
@@ -160,9 +189,97 @@
ftpCredentials.style.display = ftpEnabled ? 'block' : 'none'; ftpCredentials.style.display = ftpEnabled ? 'block' : 'none';
} }
// Add event listener for FTP checkbox // Load device list from roster
async function refreshDeviceList() {
try {
const res = await fetch('/api/nl43/roster');
const data = await res.json();
if (!res.ok) {
log('Failed to load device list');
return;
}
availableDevices = data.devices || [];
const selector = document.getElementById('deviceSelector');
// Save current selection
const currentSelection = selector.value;
// Clear and rebuild options
selector.innerHTML = '<option value="">-- Select a device --</option>';
availableDevices.forEach(device => {
const option = document.createElement('option');
option.value = device.unit_id;
// Add status indicator
let statusIcon = '⚪';
if (device.status) {
if (device.status.is_reachable === false) {
statusIcon = '🔴';
} else if (device.status.last_success) {
const lastSeen = new Date(device.status.last_success);
const ageMinutes = Math.floor((Date.now() - lastSeen) / 60000);
statusIcon = ageMinutes < 5 ? '🟢' : '🟡';
}
}
option.textContent = `${statusIcon} ${device.unit_id} (${device.host})`;
selector.appendChild(option);
});
// Restore selection if it still exists
if (currentSelection && availableDevices.find(d => d.unit_id === currentSelection)) {
selector.value = currentSelection;
}
log(`Loaded ${availableDevices.length} device(s) from roster`);
} catch (err) {
log(`Error loading device list: ${err.message}`);
}
}
// Load selected device configuration
function loadSelectedDevice() {
const selector = document.getElementById('deviceSelector');
const unitId = selector.value;
if (!unitId) {
return;
}
const device = availableDevices.find(d => d.unit_id === unitId);
if (!device) {
log(`Device ${unitId} not found in list`);
return;
}
// Populate form fields
document.getElementById('unitId').value = device.unit_id;
document.getElementById('host').value = device.host;
document.getElementById('port').value = device.tcp_port || 2255;
document.getElementById('tcpEnabled').checked = device.tcp_enabled || false;
document.getElementById('ftpEnabled').checked = device.ftp_enabled || false;
if (device.ftp_username) {
document.getElementById('ftpUsername').value = device.ftp_username;
}
if (device.ftp_password) {
document.getElementById('ftpPassword').value = device.ftp_password;
}
toggleFtpCredentials();
log(`Loaded configuration for ${device.unit_id}`);
}
// Add event listeners
document.addEventListener('DOMContentLoaded', function() { document.addEventListener('DOMContentLoaded', function() {
document.getElementById('ftpEnabled').addEventListener('change', toggleFtpCredentials); document.getElementById('ftpEnabled').addEventListener('change', toggleFtpCredentials);
// Load device list on page load
refreshDeviceList();
}); });
async function runDiagnostics() { async function runDiagnostics() {
@@ -216,6 +333,134 @@
html += `<p style="margin-top: 12px; font-size: 0.9em; color: #666;">Last run: ${new Date(data.timestamp).toLocaleString()}</p>`; html += `<p style="margin-top: 12px; font-size: 0.9em; color: #666;">Last run: ${new Date(data.timestamp).toLocaleString()}</p>`;
// Add database dump section if available
if (data.database_dump) {
html += `<div style="margin-top: 16px; border-top: 1px solid #d0d7de; padding-top: 12px;">`;
html += `<h4 style="margin: 0 0 12px 0;">📦 Database Dump</h4>`;
// Config section
if (data.database_dump.config) {
const cfg = data.database_dump.config;
html += `<div style="background: #f0f4f8; padding: 12px; border-radius: 4px; margin-bottom: 12px;">`;
html += `<strong>Configuration (nl43_config)</strong>`;
html += `<table style="width: 100%; margin-top: 8px; font-size: 0.9em;">`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Host</td><td>${cfg.host}:${cfg.tcp_port}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">TCP Enabled</td><td>${cfg.tcp_enabled ? '✓' : '✗'}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">FTP Enabled</td><td>${cfg.ftp_enabled ? '✓' : '✗'}${cfg.ftp_enabled ? ` (port ${cfg.ftp_port}, user: ${cfg.ftp_username || 'none'})` : ''}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Background Polling</td><td>${cfg.poll_enabled ? `✓ every ${cfg.poll_interval_seconds}s` : '✗ disabled'}</td></tr>`;
html += `</table></div>`;
}
// Status cache section
if (data.database_dump.status_cache) {
const cache = data.database_dump.status_cache;
html += `<div style="background: #f0f8f4; padding: 12px; border-radius: 4px; margin-bottom: 12px;">`;
html += `<strong>Status Cache (nl43_status)</strong>`;
html += `<table style="width: 100%; margin-top: 8px; font-size: 0.9em;">`;
// Measurement state and timing
html += `<tr><td style="padding: 2px 8px; color: #666;">Measurement State</td><td><strong>${cache.measurement_state || 'unknown'}</strong></td></tr>`;
if (cache.measurement_start_time) {
const startTime = new Date(cache.measurement_start_time);
const elapsed = Math.floor((Date.now() - startTime) / 1000);
const elapsedStr = elapsed > 3600 ? `${Math.floor(elapsed/3600)}h ${Math.floor((elapsed%3600)/60)}m` : elapsed > 60 ? `${Math.floor(elapsed/60)}m ${elapsed%60}s` : `${elapsed}s`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Measurement Started</td><td>${startTime.toLocaleString()} (${elapsedStr} ago)</td></tr>`;
}
html += `<tr><td style="padding: 2px 8px; color: #666;">Counter (d0)</td><td>${cache.counter || 'N/A'}</td></tr>`;
// Sound levels
html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Sound Levels (dB)</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Lp (Instantaneous)</td><td>${cache.lp || 'N/A'}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Leq (Equivalent)</td><td>${cache.leq || 'N/A'}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Lmax / Lmin</td><td>${cache.lmax || 'N/A'} / ${cache.lmin || 'N/A'}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Lpeak</td><td>${cache.lpeak || 'N/A'}</td></tr>`;
// Device status
html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Device Status</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Battery</td><td>${cache.battery_level || 'N/A'}${cache.power_source ? ` (${cache.power_source})` : ''}</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">SD Card</td><td>${cache.sd_remaining_mb ? `${cache.sd_remaining_mb} MB` : 'N/A'}${cache.sd_free_ratio ? ` (${cache.sd_free_ratio} free)` : ''}</td></tr>`;
// Polling status
html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Polling Status</td></tr>`;
html += `<tr><td style="padding: 2px 8px; color: #666;">Reachable</td><td>${cache.is_reachable ? '🟢 Yes' : '🔴 No'}</td></tr>`;
if (cache.last_seen) {
html += `<tr><td style="padding: 2px 8px; color: #666;">Last Seen</td><td>${new Date(cache.last_seen).toLocaleString()}</td></tr>`;
}
if (cache.last_success) {
html += `<tr><td style="padding: 2px 8px; color: #666;">Last Success</td><td>${new Date(cache.last_success).toLocaleString()}</td></tr>`;
}
if (cache.last_poll_attempt) {
html += `<tr><td style="padding: 2px 8px; color: #666;">Last Poll Attempt</td><td>${new Date(cache.last_poll_attempt).toLocaleString()}</td></tr>`;
}
html += `<tr><td style="padding: 2px 8px; color: #666;">Consecutive Failures</td><td>${cache.consecutive_failures || 0}</td></tr>`;
if (cache.last_error) {
html += `<tr><td style="padding: 2px 8px; color: #666;">Last Error</td><td style="color: #d00; font-size: 0.85em;">${cache.last_error}</td></tr>`;
}
html += `</table></div>`;
// Raw payload (collapsible)
if (cache.raw_payload) {
html += `<details style="margin-top: 8px;"><summary style="cursor: pointer; color: #666; font-size: 0.9em;">📄 Raw Payload</summary>`;
html += `<pre style="background: #f6f8fa; padding: 8px; border-radius: 4px; font-size: 0.8em; overflow-x: auto; margin-top: 8px;">${cache.raw_payload}</pre></details>`;
}
} else {
html += `<p style="color: #888; font-style: italic;">No cached status available for this unit.</p>`;
}
html += `</div>`;
}
// Fetch and display device logs
try {
const logsRes = await fetch(`/api/nl43/${unitId}/logs?limit=50`);
if (logsRes.ok) {
const logsData = await logsRes.json();
if (logsData.logs && logsData.logs.length > 0) {
html += `<div style="margin-top: 16px; border-top: 1px solid #d0d7de; padding-top: 12px;">`;
html += `<h4 style="margin: 0 0 12px 0;">📋 Device Logs (${logsData.stats.total} total)</h4>`;
// Stats summary
if (logsData.stats.by_level) {
html += `<div style="margin-bottom: 8px; font-size: 0.85em; color: #666;">`;
const levels = logsData.stats.by_level;
const parts = [];
if (levels.ERROR) parts.push(`<span style="color: #d00;">${levels.ERROR} errors</span>`);
if (levels.WARNING) parts.push(`<span style="color: #fa0;">${levels.WARNING} warnings</span>`);
if (levels.INFO) parts.push(`${levels.INFO} info`);
html += parts.join(' · ');
html += `</div>`;
}
// Log entries (collapsible)
html += `<details open><summary style="cursor: pointer; font-size: 0.9em; margin-bottom: 8px;">Recent entries (${logsData.logs.length})</summary>`;
html += `<div style="max-height: 300px; overflow-y: auto; background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 4px; padding: 8px; font-size: 0.8em; font-family: monospace;">`;
logsData.logs.forEach(entry => {
const levelColor = {
'ERROR': '#d00',
'WARNING': '#b86e00',
'INFO': '#0969da',
'DEBUG': '#888'
}[entry.level] || '#666';
const time = new Date(entry.timestamp).toLocaleString();
html += `<div style="margin-bottom: 4px; border-bottom: 1px solid #eee; padding-bottom: 4px;">`;
html += `<span style="color: #888;">${time}</span> `;
html += `<span style="color: ${levelColor}; font-weight: 600;">[${entry.level}]</span> `;
html += `<span style="color: #666;">[${entry.category}]</span> `;
html += `${entry.message}`;
html += `</div>`;
});
html += `</div></details>`;
html += `</div>`;
}
}
} catch (logErr) {
console.log('Could not fetch device logs:', logErr);
}
resultsEl.innerHTML = html; resultsEl.innerHTML = html;
log(`Diagnostics complete: ${data.overall_status}`); log(`Diagnostics complete: ${data.overall_status}`);
+901
View File
@@ -0,0 +1,901 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8" />
<meta name="viewport" content="width=device-width, initial-scale=1.0" />
<title>SLMM - Device Roster &amp; Connections</title>
<style>
* { box-sizing: border-box; }
body {
font-family: system-ui, -apple-system, sans-serif;
margin: 0;
padding: 24px;
background: #f6f8fa;
}
.container { max-width: 1400px; margin: 0 auto; }
.header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 24px;
padding: 16px;
background: white;
border-radius: 6px;
box-shadow: 0 1px 3px rgba(0,0,0,0.1);
}
h1 { margin: 0; font-size: 24px; }
.nav { display: flex; gap: 12px; }
.btn {
padding: 8px 16px;
border: 1px solid #d0d7de;
background: white;
border-radius: 6px;
cursor: pointer;
text-decoration: none;
color: #24292f;
font-size: 14px;
transition: background 0.2s;
}
.btn:hover { background: #f6f8fa; }
.btn-primary {
background: #2da44e;
color: white;
border-color: #2da44e;
}
.btn-primary:hover { background: #2c974b; }
.btn-danger {
background: #cf222e;
color: white;
border-color: #cf222e;
}
.btn-danger:hover { background: #a40e26; }
.btn-small {
padding: 4px 8px;
font-size: 12px;
margin-right: 4px;
}
.table-container {
background: white;
border-radius: 6px;
box-shadow: 0 1px 3px rgba(0,0,0,0.1);
overflow-x: auto;
}
table {
width: 100%;
border-collapse: collapse;
}
th {
background: #f6f8fa;
padding: 12px;
text-align: left;
font-weight: 600;
border-bottom: 2px solid #d0d7de;
font-size: 13px;
white-space: nowrap;
}
td {
padding: 12px;
border-bottom: 1px solid #d0d7de;
font-size: 13px;
}
tr:hover { background: #f6f8fa; }
.status-badge {
display: inline-block;
padding: 2px 8px;
border-radius: 12px;
font-size: 11px;
font-weight: 600;
text-transform: uppercase;
}
.status-ok {
background: #dafbe1;
color: #1a7f37;
}
.status-unknown {
background: #eaeef2;
color: #57606a;
}
.status-error {
background: #ffebe9;
color: #cf222e;
}
.checkbox-cell {
text-align: center;
width: 80px;
}
.checkbox-cell input[type="checkbox"] {
cursor: pointer;
width: 16px;
height: 16px;
}
.actions-cell {
white-space: nowrap;
width: 200px;
}
.empty-state {
text-align: center;
padding: 48px;
color: #57606a;
}
.empty-state-icon {
font-size: 48px;
margin-bottom: 16px;
}
.modal {
display: none;
position: fixed;
top: 0;
left: 0;
width: 100%;
height: 100%;
background: rgba(0,0,0,0.5);
z-index: 1000;
align-items: center;
justify-content: center;
}
.modal.active { display: flex; }
.modal-content {
background: white;
padding: 24px;
border-radius: 6px;
max-width: 600px;
width: 90%;
max-height: 80vh;
overflow-y: auto;
}
.modal-header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 16px;
}
.modal-header h2 {
margin: 0;
font-size: 20px;
}
.close-btn {
background: none;
border: none;
font-size: 24px;
cursor: pointer;
color: #57606a;
padding: 0;
width: 32px;
height: 32px;
}
.close-btn:hover { color: #24292f; }
.form-group {
margin-bottom: 16px;
}
.form-group label {
display: block;
margin-bottom: 6px;
font-weight: 600;
font-size: 14px;
}
.form-group input[type="text"],
.form-group input[type="number"],
.form-group input[type="password"] {
width: 100%;
padding: 8px 12px;
border: 1px solid #d0d7de;
border-radius: 6px;
font-size: 14px;
}
.form-group input[type="checkbox"] {
width: auto;
margin-right: 8px;
}
.checkbox-label {
display: flex;
align-items: center;
font-weight: normal;
cursor: pointer;
}
.form-actions {
display: flex;
justify-content: flex-end;
gap: 8px;
margin-top: 24px;
}
.toast {
position: fixed;
top: 24px;
right: 24px;
padding: 12px 16px;
background: #24292f;
color: white;
border-radius: 6px;
box-shadow: 0 4px 12px rgba(0,0,0,0.15);
z-index: 2000;
display: none;
min-width: 300px;
}
.toast.active {
display: block;
animation: slideIn 0.3s ease-out;
}
@keyframes slideIn {
from {
transform: translateX(400px);
opacity: 0;
}
to {
transform: translateX(0);
opacity: 1;
}
}
.toast-success { background: #2da44e; }
.toast-error { background: #cf222e; }
/* Tabs */
.tabs {
display: flex;
gap: 0;
margin-bottom: 0;
border-bottom: 2px solid #d0d7de;
}
.tab-btn {
padding: 10px 20px;
border: none;
background: none;
cursor: pointer;
font-size: 14px;
font-weight: 600;
color: #57606a;
border-bottom: 2px solid transparent;
margin-bottom: -2px;
transition: color 0.2s, border-color 0.2s;
}
.tab-btn:hover { color: #24292f; }
.tab-btn.active {
color: #24292f;
border-bottom-color: #fd8c73;
}
.tab-panel { display: none; }
.tab-panel.active { display: block; }
/* Connection pool panel */
.pool-config {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(180px, 1fr));
gap: 12px;
margin-bottom: 20px;
}
.pool-config-card {
background: #f6f8fa;
border: 1px solid #d0d7de;
border-radius: 6px;
padding: 12px;
}
.pool-config-card .label {
font-size: 11px;
color: #57606a;
text-transform: uppercase;
font-weight: 600;
margin-bottom: 4px;
}
.pool-config-card .value {
font-size: 18px;
font-weight: 600;
color: #24292f;
}
.conn-card {
background: white;
border: 1px solid #d0d7de;
border-radius: 6px;
padding: 16px;
margin-bottom: 12px;
}
.conn-card-header {
display: flex;
justify-content: space-between;
align-items: center;
margin-bottom: 12px;
}
.conn-card-header strong { font-size: 15px; }
.conn-card-grid {
display: grid;
grid-template-columns: repeat(auto-fill, minmax(140px, 1fr));
gap: 8px;
}
.conn-stat .label {
font-size: 11px;
color: #57606a;
text-transform: uppercase;
font-weight: 600;
}
.conn-stat .value {
font-size: 14px;
font-weight: 600;
color: #24292f;
}
.conn-empty {
text-align: center;
padding: 32px;
color: #57606a;
}
.pool-actions {
display: flex;
gap: 8px;
margin-bottom: 16px;
}
</style>
</head>
<body>
<div class="container">
<div class="header">
<h1>SLMM - Roster &amp; Connections</h1>
<div class="nav">
<a href="/" class="btn">&larr; Back to Control Panel</a>
<button class="btn btn-primary" onclick="openAddModal()">+ Add Device</button>
</div>
</div>
<div class="tabs">
<button class="tab-btn active" onclick="switchTab('roster')">Device Roster</button>
<button class="tab-btn" onclick="switchTab('connections')">Connections</button>
</div>
<!-- Roster Tab -->
<div id="tab-roster" class="tab-panel active">
<div class="table-container" style="border-top-left-radius: 0; border-top-right-radius: 0;">
<table id="rosterTable">
<thead>
<tr>
<th>Unit ID</th>
<th>Host / IP</th>
<th>TCP Port</th>
<th>FTP Port</th>
<th class="checkbox-cell">TCP</th>
<th class="checkbox-cell">FTP</th>
<th class="checkbox-cell">Polling</th>
<th>Status</th>
<th class="actions-cell">Actions</th>
</tr>
</thead>
<tbody id="rosterBody">
<tr>
<td colspan="9" style="text-align: center; padding: 24px;">
Loading...
</td>
</tr>
</tbody>
</table>
</div>
</div>
<!-- Connections Tab -->
<div id="tab-connections" class="tab-panel">
<div class="table-container" style="padding: 20px; border-top-left-radius: 0; border-top-right-radius: 0;">
<div class="pool-actions">
<button class="btn" onclick="loadConnections()">Refresh</button>
<button class="btn btn-danger" onclick="flushConnections()">Flush All Connections</button>
</div>
<h3 style="margin: 0 0 12px 0; font-size: 16px;">Pool Configuration</h3>
<div id="poolConfig" class="pool-config">
<div class="pool-config-card">
<div class="label">Status</div>
<div class="value" id="poolEnabled">--</div>
</div>
</div>
<h3 style="margin: 20px 0 12px 0; font-size: 16px;">Active Connections</h3>
<div id="connectionsList">
<div class="conn-empty">Loading...</div>
</div>
</div>
</div>
</div>
<!-- Add/Edit Modal -->
<div id="deviceModal" class="modal">
<div class="modal-content">
<div class="modal-header">
<h2 id="modalTitle">Add Device</h2>
<button class="close-btn" onclick="closeModal()">&times;</button>
</div>
<form id="deviceForm" onsubmit="saveDevice(event)">
<div class="form-group">
<label for="unitId">Unit ID *</label>
<input type="text" id="unitId" required placeholder="e.g., nl43-1, slm-site-a" />
</div>
<div class="form-group">
<label for="host">Host / IP Address *</label>
<input type="text" id="host" required placeholder="e.g., 192.168.1.100" />
</div>
<div class="form-group">
<label for="tcpPort">TCP Port *</label>
<input type="number" id="tcpPort" required value="2255" min="1" max="65535" />
</div>
<div class="form-group">
<label for="ftpPort">FTP Port</label>
<input type="number" id="ftpPort" value="21" min="1" max="65535" />
</div>
<div class="form-group">
<label class="checkbox-label">
<input type="checkbox" id="tcpEnabled" checked />
TCP Enabled (required for remote control)
</label>
</div>
<div class="form-group">
<label class="checkbox-label">
<input type="checkbox" id="ftpEnabled" onchange="toggleFtpCredentials()" />
FTP Enabled (for file downloads)
</label>
</div>
<div id="ftpCredentialsSection" style="display: none; padding: 12px; background: #f6f8fa; border-radius: 6px; margin-bottom: 16px;">
<div class="form-group">
<label for="ftpUsername">FTP Username</label>
<input type="text" id="ftpUsername" placeholder="Default: USER" />
</div>
<div class="form-group">
<label for="ftpPassword">FTP Password</label>
<input type="password" id="ftpPassword" placeholder="Default: 0000" />
</div>
</div>
<div class="form-group">
<label class="checkbox-label">
<input type="checkbox" id="pollEnabled" checked />
Enable background polling (status updates)
</label>
</div>
<div class="form-group">
<label for="pollInterval">Polling Interval (seconds)</label>
<input type="number" id="pollInterval" value="60" min="10" max="3600" />
</div>
<div class="form-actions">
<button type="button" class="btn" onclick="closeModal()">Cancel</button>
<button type="submit" class="btn btn-primary">Save Device</button>
</div>
</form>
</div>
</div>
<!-- Toast Notification -->
<div id="toast" class="toast"></div>
<script>
let devices = [];
let editingDeviceId = null;
// Load roster on page load
document.addEventListener('DOMContentLoaded', () => {
loadRoster();
});
async function loadRoster() {
try {
const res = await fetch('/api/nl43/roster');
const data = await res.json();
if (!res.ok) {
showToast('Failed to load roster', 'error');
return;
}
devices = data.devices || [];
renderRoster();
} catch (err) {
showToast('Error loading roster: ' + err.message, 'error');
console.error('Load roster error:', err);
}
}
function renderRoster() {
const tbody = document.getElementById('rosterBody');
if (devices.length === 0) {
tbody.innerHTML = `
<tr>
<td colspan="9" class="empty-state">
<div class="empty-state-icon">📭</div>
<div><strong>No devices configured</strong></div>
<div style="margin-top: 8px; font-size: 14px;">Click "Add Device" to configure your first sound level meter</div>
</td>
</tr>
`;
return;
}
tbody.innerHTML = devices.map(device => `
<tr>
<td><strong>${escapeHtml(device.unit_id)}</strong></td>
<td>${escapeHtml(device.host)}</td>
<td>${device.tcp_port}</td>
<td>${device.ftp_port || 21}</td>
<td class="checkbox-cell">
<input type="checkbox" ${device.tcp_enabled ? 'checked' : ''} disabled />
</td>
<td class="checkbox-cell">
<input type="checkbox" ${device.ftp_enabled ? 'checked' : ''} disabled />
</td>
<td class="checkbox-cell">
<input type="checkbox" ${device.poll_enabled ? 'checked' : ''} disabled />
</td>
<td>
${getStatusBadge(device)}
</td>
<td class="actions-cell">
<button class="btn btn-small" onclick="testDevice('${escapeHtml(device.unit_id)}')">Test</button>
<button class="btn btn-small" onclick="openEditModal('${escapeHtml(device.unit_id)}')">Edit</button>
<button class="btn btn-small btn-danger" onclick="deleteDevice('${escapeHtml(device.unit_id)}')">Delete</button>
</td>
</tr>
`).join('');
}
function getStatusBadge(device) {
if (!device.status) {
return '<span class="status-badge status-unknown">Unknown</span>';
}
if (device.status.is_reachable === false) {
return '<span class="status-badge status-error">Offline</span>';
}
if (device.status.last_success) {
const lastSeen = new Date(device.status.last_success);
const ago = Math.floor((Date.now() - lastSeen) / 1000);
if (ago < 300) { // Less than 5 minutes
return '<span class="status-badge status-ok">Online</span>';
} else {
return `<span class="status-badge status-unknown">Stale (${Math.floor(ago / 60)}m ago)</span>`;
}
}
return '<span class="status-badge status-unknown">Unknown</span>';
}
function escapeHtml(text) {
const map = {
'&': '&amp;',
'<': '&lt;',
'>': '&gt;',
'"': '&quot;',
"'": '&#039;'
};
return String(text).replace(/[&<>"']/g, m => map[m]);
}
function openAddModal() {
editingDeviceId = null;
document.getElementById('modalTitle').textContent = 'Add Device';
document.getElementById('deviceForm').reset();
document.getElementById('unitId').disabled = false;
document.getElementById('tcpEnabled').checked = true;
document.getElementById('ftpEnabled').checked = false;
document.getElementById('pollEnabled').checked = true;
document.getElementById('tcpPort').value = 2255;
document.getElementById('ftpPort').value = 21;
document.getElementById('pollInterval').value = 60;
toggleFtpCredentials();
document.getElementById('deviceModal').classList.add('active');
}
function openEditModal(unitId) {
const device = devices.find(d => d.unit_id === unitId);
if (!device) {
showToast('Device not found', 'error');
return;
}
editingDeviceId = unitId;
document.getElementById('modalTitle').textContent = 'Edit Device';
document.getElementById('unitId').value = device.unit_id;
document.getElementById('unitId').disabled = true;
document.getElementById('host').value = device.host;
document.getElementById('tcpPort').value = device.tcp_port;
document.getElementById('ftpPort').value = device.ftp_port || 21;
document.getElementById('tcpEnabled').checked = device.tcp_enabled;
document.getElementById('ftpEnabled').checked = device.ftp_enabled;
document.getElementById('ftpUsername').value = device.ftp_username || '';
document.getElementById('ftpPassword').value = device.ftp_password || '';
document.getElementById('pollEnabled').checked = device.poll_enabled;
document.getElementById('pollInterval').value = device.poll_interval_seconds || 60;
toggleFtpCredentials();
document.getElementById('deviceModal').classList.add('active');
}
function closeModal() {
document.getElementById('deviceModal').classList.remove('active');
editingDeviceId = null;
}
function toggleFtpCredentials() {
const ftpEnabled = document.getElementById('ftpEnabled').checked;
document.getElementById('ftpCredentialsSection').style.display = ftpEnabled ? 'block' : 'none';
}
async function saveDevice(event) {
event.preventDefault();
const unitId = document.getElementById('unitId').value.trim();
const payload = {
host: document.getElementById('host').value.trim(),
tcp_port: parseInt(document.getElementById('tcpPort').value),
ftp_port: parseInt(document.getElementById('ftpPort').value),
tcp_enabled: document.getElementById('tcpEnabled').checked,
ftp_enabled: document.getElementById('ftpEnabled').checked,
poll_enabled: document.getElementById('pollEnabled').checked,
poll_interval_seconds: parseInt(document.getElementById('pollInterval').value)
};
if (payload.ftp_enabled) {
const username = document.getElementById('ftpUsername').value.trim();
const password = document.getElementById('ftpPassword').value.trim();
if (username) payload.ftp_username = username;
if (password) payload.ftp_password = password;
}
try {
const url = editingDeviceId
? `/api/nl43/${editingDeviceId}/config`
: `/api/nl43/roster`;
const method = editingDeviceId ? 'PUT' : 'POST';
const body = editingDeviceId
? payload
: { unit_id: unitId, ...payload };
const res = await fetch(url, {
method,
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body)
});
const data = await res.json();
if (!res.ok) {
showToast(data.detail || 'Failed to save device', 'error');
return;
}
showToast(editingDeviceId ? 'Device updated successfully' : 'Device added successfully', 'success');
closeModal();
await loadRoster();
} catch (err) {
showToast('Error saving device: ' + err.message, 'error');
console.error('Save device error:', err);
}
}
async function deleteDevice(unitId) {
if (!confirm(`Are you sure you want to delete "${unitId}"?\n\nThis will remove the device configuration but will not affect the physical device.`)) {
return;
}
try {
const res = await fetch(`/api/nl43/${unitId}/config`, {
method: 'DELETE'
});
const data = await res.json();
if (!res.ok) {
showToast(data.detail || 'Failed to delete device', 'error');
return;
}
showToast('Device deleted successfully', 'success');
await loadRoster();
} catch (err) {
showToast('Error deleting device: ' + err.message, 'error');
console.error('Delete device error:', err);
}
}
async function testDevice(unitId) {
showToast('Testing device connection...', 'success');
try {
const res = await fetch(`/api/nl43/${unitId}/diagnostics`);
const data = await res.json();
if (!res.ok) {
showToast('Device test failed', 'error');
return;
}
const statusText = {
'pass': 'All systems operational ✓',
'fail': 'Connection failed ✗',
'degraded': 'Partial connectivity ⚠'
};
showToast(statusText[data.overall_status] || 'Test complete',
data.overall_status === 'pass' ? 'success' : 'error');
// Reload to update status
await loadRoster();
} catch (err) {
showToast('Error testing device: ' + err.message, 'error');
console.error('Test device error:', err);
}
}
function showToast(message, type = 'success') {
const toast = document.getElementById('toast');
toast.textContent = message;
toast.className = `toast toast-${type} active`;
setTimeout(() => {
toast.classList.remove('active');
}, 3000);
}
// Close modal when clicking outside
document.getElementById('deviceModal').addEventListener('click', (e) => {
if (e.target.id === 'deviceModal') {
closeModal();
}
});
// ========== Tab Switching ==========
function switchTab(tabName) {
document.querySelectorAll('.tab-btn').forEach(btn => btn.classList.remove('active'));
document.querySelectorAll('.tab-panel').forEach(panel => panel.classList.remove('active'));
document.querySelector(`.tab-btn[onclick="switchTab('${tabName}')"]`).classList.add('active');
document.getElementById(`tab-${tabName}`).classList.add('active');
if (tabName === 'connections') {
loadConnections();
}
}
// ========== Connection Pool ==========
let connectionsRefreshTimer = null;
async function loadConnections() {
try {
const res = await fetch('/api/nl43/_connections/status');
const data = await res.json();
if (!res.ok) {
showToast('Failed to load connection pool status', 'error');
return;
}
const pool = data.pool;
renderPoolConfig(pool);
renderConnections(pool.connections);
// Auto-refresh while tab is active
clearTimeout(connectionsRefreshTimer);
if (document.getElementById('tab-connections').classList.contains('active')) {
connectionsRefreshTimer = setTimeout(loadConnections, 5000);
}
} catch (err) {
showToast('Error loading connections: ' + err.message, 'error');
console.error('Load connections error:', err);
}
}
function renderPoolConfig(pool) {
document.getElementById('poolConfig').innerHTML = `
<div class="pool-config-card">
<div class="label">Persistent</div>
<div class="value" style="color: ${pool.enabled ? '#1a7f37' : '#cf222e'}">${pool.enabled ? 'Enabled' : 'Disabled'}</div>
</div>
<div class="pool-config-card">
<div class="label">Active</div>
<div class="value">${pool.active_connections}</div>
</div>
<div class="pool-config-card">
<div class="label">Idle TTL</div>
<div class="value">${pool.idle_ttl}s</div>
</div>
<div class="pool-config-card">
<div class="label">Max Age</div>
<div class="value">${pool.max_age}s</div>
</div>
<div class="pool-config-card">
<div class="label">KA Idle</div>
<div class="value">${pool.keepalive_idle}s</div>
</div>
<div class="pool-config-card">
<div class="label">KA Interval</div>
<div class="value">${pool.keepalive_interval}s</div>
</div>
<div class="pool-config-card">
<div class="label">KA Probes</div>
<div class="value">${pool.keepalive_count}</div>
</div>
`;
}
function renderConnections(connections) {
const container = document.getElementById('connectionsList');
const keys = Object.keys(connections);
if (keys.length === 0) {
container.innerHTML = `
<div class="conn-empty">
<div style="font-size: 32px; margin-bottom: 8px;">~</div>
<div><strong>No active connections</strong></div>
<div style="margin-top: 4px; font-size: 13px;">
Connections appear here when devices are actively being polled and the connection is cached between commands.
</div>
</div>
`;
return;
}
container.innerHTML = keys.map(key => {
const conn = connections[key];
const aliveColor = conn.alive ? '#1a7f37' : '#cf222e';
const aliveText = conn.alive ? 'Alive' : 'Stale';
return `
<div class="conn-card">
<div class="conn-card-header">
<strong>${escapeHtml(key)}</strong>
<span class="status-badge ${conn.alive ? 'status-ok' : 'status-error'}">${aliveText}</span>
</div>
<div class="conn-card-grid">
<div class="conn-stat">
<div class="label">Host</div>
<div class="value">${escapeHtml(conn.host)}</div>
</div>
<div class="conn-stat">
<div class="label">Port</div>
<div class="value">${conn.port}</div>
</div>
<div class="conn-stat">
<div class="label">Age</div>
<div class="value">${formatSeconds(conn.age_seconds)}</div>
</div>
<div class="conn-stat">
<div class="label">Idle</div>
<div class="value">${formatSeconds(conn.idle_seconds)}</div>
</div>
</div>
</div>
`;
}).join('');
}
function formatSeconds(s) {
if (s < 60) return Math.round(s) + 's';
if (s < 3600) return Math.floor(s / 60) + 'm ' + Math.round(s % 60) + 's';
return Math.floor(s / 3600) + 'h ' + Math.floor((s % 3600) / 60) + 'm';
}
async function flushConnections() {
if (!confirm('Close all cached TCP connections?\n\nDevices will reconnect on the next poll cycle.')) {
return;
}
try {
const res = await fetch('/api/nl43/_connections/flush', { method: 'POST' });
const data = await res.json();
if (!res.ok) {
showToast(data.detail || 'Failed to flush connections', 'error');
return;
}
showToast('All connections flushed', 'success');
await loadConnections();
} catch (err) {
showToast('Error flushing connections: ' + err.message, 'error');
}
}
</script>
</body>
</html>
+68
View File
@@ -0,0 +1,68 @@
"""
Synthetic unit test for the alert state machine no DB, no device.
Drives `_evaluate_step` with a fake clock + a level series and checks that
onset/clear fire with the right debounce + hysteresis. Run:
docker compose exec -T slmm python3 test_alert_evaluator.py
# or, if app.alerts imports cleanly standalone: python3 test_alert_evaluator.py
"""
from types import SimpleNamespace
from app.alerts import RuleState, _evaluate_step
def rule(**kw):
base = dict(threshold_db=85.0, duration_s=3, clear_margin_db=2.0, comparison="above")
base.update(kw)
return SimpleNamespace(**base)
def run(series, r):
st = RuleState()
events = [(now, a) for value, now in series
if (a := _evaluate_step(st, value, now, r))]
return events, st
def main():
failures = 0
def check(label, cond, detail=""):
nonlocal failures
print(("PASS" if cond else "FAIL"), label, detail)
if not cond:
failures += 1
# 1) sustained exceedance -> onset after duration; recovery -> clear after duration
r = rule(threshold_db=85, duration_s=3, clear_margin_db=2)
ev, _ = run([(80, 0), (86, 1), (87, 2), (88, 3), (88, 4),
(88, 5), (82, 6), (82, 7), (82, 8), (82, 9)], r)
onsets = [t for t, a in ev if a == "onset"]
clears = [t for t, a in ev if a == "clear"]
check("1 sustained onset@4 / clear@9", onsets == [4] and clears == [9], str(ev))
# 2) brief spike under duration -> no onset (debounce)
ev, _ = run([(80, 0), (90, 1), (90, 2), (80, 3), (80, 4)], rule(duration_s=3))
check("2 brief spike debounced", ev == [], str(ev))
# 3) hysteresis: a dip into the margin (below threshold, above threshold-margin)
# does NOT clear
r = rule(threshold_db=85, duration_s=0, clear_margin_db=3)
ev, st = run([(86, 0), (84, 1), (84, 2), (84, 3)], r)
check("3 hysteresis holds ACTIVE", ev == [(0, "onset")] and st.phase == "active",
f"{ev} phase={st.phase}")
# 4) 'below' comparison (device too quiet) -> onset when value < threshold
ev, _ = run([(30, 0), (15, 1)], rule(threshold_db=20, duration_s=0,
clear_margin_db=2, comparison="below"))
check("4 below-comparison onset@1", ev == [(1, "onset")], str(ev))
print()
print("ALL PASS" if failures == 0 else f"{failures} FAILURE(S)")
return failures
if __name__ == "__main__":
import sys
sys.exit(1 if main() else 0)
+167
View File
@@ -0,0 +1,167 @@
#!/bin/bash
# Manual test script for background polling functionality
# Usage: ./test_polling.sh [UNIT_ID]
BASE_URL="http://localhost:8100/api/nl43"
UNIT_ID="${1:-NL43-001}"
echo "=========================================="
echo "Background Polling Test Script"
echo "=========================================="
echo "Testing device: $UNIT_ID"
echo "Base URL: $BASE_URL"
echo ""
# Color codes for output
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
RED='\033[0;31m'
NC='\033[0m' # No Color
# Function to print test header
test_header() {
echo ""
echo "=========================================="
echo "$1"
echo "=========================================="
}
# Function to print success
success() {
echo -e "${GREEN}${NC} $1"
}
# Function to print warning
warning() {
echo -e "${YELLOW}${NC} $1"
}
# Function to print error
error() {
echo -e "${RED}${NC} $1"
}
# Test 1: Get current polling configuration
test_header "Test 1: Get Current Polling Configuration"
RESPONSE=$(curl -s "$BASE_URL/$UNIT_ID/polling/config")
echo "$RESPONSE" | jq '.'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
success "Successfully retrieved polling configuration"
CURRENT_INTERVAL=$(echo "$RESPONSE" | jq -r '.data.poll_interval_seconds')
CURRENT_ENABLED=$(echo "$RESPONSE" | jq -r '.data.poll_enabled')
echo " Current interval: ${CURRENT_INTERVAL}s"
echo " Polling enabled: $CURRENT_ENABLED"
else
error "Failed to retrieve polling configuration"
exit 1
fi
# Test 2: Update polling interval to 30 seconds
test_header "Test 2: Update Polling Interval to 30 Seconds"
RESPONSE=$(curl -s -X PUT "$BASE_URL/$UNIT_ID/polling/config" \
-H "Content-Type: application/json" \
-d '{"poll_interval_seconds": 30}')
echo "$RESPONSE" | jq '.'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
success "Successfully updated polling interval to 30s"
else
error "Failed to update polling interval"
fi
# Test 3: Check global polling status
test_header "Test 3: Check Global Polling Status"
RESPONSE=$(curl -s "$BASE_URL/_polling/status")
echo "$RESPONSE" | jq '.'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
success "Successfully retrieved global polling status"
POLLER_RUNNING=$(echo "$RESPONSE" | jq -r '.data.poller_running')
TOTAL_DEVICES=$(echo "$RESPONSE" | jq -r '.data.total_devices')
echo " Poller running: $POLLER_RUNNING"
echo " Total devices: $TOTAL_DEVICES"
else
error "Failed to retrieve global polling status"
fi
# Test 4: Wait for automatic poll to occur
test_header "Test 4: Wait for Automatic Poll (35 seconds)"
warning "Waiting 35 seconds for automatic poll to occur..."
for i in {35..1}; do
echo -ne " ${i}s remaining...\r"
sleep 1
done
echo ""
success "Wait complete"
# Test 5: Check if status was updated by background poller
test_header "Test 5: Verify Background Poll Occurred"
RESPONSE=$(curl -s "$BASE_URL/$UNIT_ID/status")
echo "$RESPONSE" | jq '{last_poll_attempt, last_success, is_reachable, consecutive_failures}'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
LAST_POLL=$(echo "$RESPONSE" | jq -r '.data.last_poll_attempt')
IS_REACHABLE=$(echo "$RESPONSE" | jq -r '.data.is_reachable')
FAILURES=$(echo "$RESPONSE" | jq -r '.data.consecutive_failures')
if [ "$LAST_POLL" != "null" ]; then
success "Device was polled by background poller"
echo " Last poll: $LAST_POLL"
echo " Reachable: $IS_REACHABLE"
echo " Failures: $FAILURES"
else
warning "No automatic poll detected yet"
fi
else
error "Failed to retrieve device status"
fi
# Test 6: Disable polling
test_header "Test 6: Disable Background Polling"
RESPONSE=$(curl -s -X PUT "$BASE_URL/$UNIT_ID/polling/config" \
-H "Content-Type: application/json" \
-d '{"poll_enabled": false}')
echo "$RESPONSE" | jq '.'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
success "Successfully disabled background polling"
else
error "Failed to disable polling"
fi
# Test 7: Verify polling is disabled
test_header "Test 7: Verify Polling Disabled in Global Status"
RESPONSE=$(curl -s "$BASE_URL/_polling/status")
DEVICE_ENABLED=$(echo "$RESPONSE" | jq --arg uid "$UNIT_ID" '.data.devices[] | select(.unit_id == $uid) | .poll_enabled')
if [ "$DEVICE_ENABLED" == "false" ]; then
success "Polling correctly shows as disabled for $UNIT_ID"
else
warning "Device still appears in polling list or shows as enabled"
fi
# Test 8: Re-enable polling with original interval
test_header "Test 8: Re-enable Polling with Original Interval"
RESPONSE=$(curl -s -X PUT "$BASE_URL/$UNIT_ID/polling/config" \
-H "Content-Type: application/json" \
-d "{\"poll_enabled\": true, \"poll_interval_seconds\": $CURRENT_INTERVAL}")
echo "$RESPONSE" | jq '.'
if echo "$RESPONSE" | jq -e '.status == "ok"' > /dev/null; then
success "Successfully re-enabled polling with ${CURRENT_INTERVAL}s interval"
else
error "Failed to re-enable polling"
fi
# Summary
test_header "Test Summary"
echo "All tests completed!"
echo ""
echo "Key endpoints tested:"
echo " GET $BASE_URL/{unit_id}/polling/config"
echo " PUT $BASE_URL/{unit_id}/polling/config"
echo " GET $BASE_URL/_polling/status"
echo " GET $BASE_URL/{unit_id}/status (with polling fields)"
echo ""
success "Background polling feature is working correctly"
-128
View File
@@ -1,128 +0,0 @@
#!/usr/bin/env python3
"""
Test script to verify that sleep mode is automatically disabled when:
1. Device configuration is created/updated with TCP enabled
2. Measurements are started
This script tests the API endpoints, not the actual device communication.
"""
import requests
import json
BASE_URL = "http://localhost:8100/api/nl43"
UNIT_ID = "test-nl43-001"
def test_config_update():
"""Test that config update works (actual sleep mode disable requires real device)"""
print("\n=== Testing Config Update ===")
# Create/update a device config
config_data = {
"host": "192.168.1.100",
"tcp_port": 2255,
"tcp_enabled": True,
"ftp_enabled": False,
"ftp_username": "admin",
"ftp_password": "password"
}
print(f"Updating config for {UNIT_ID}...")
response = requests.put(f"{BASE_URL}/{UNIT_ID}/config", json=config_data)
if response.status_code == 200:
print("✓ Config updated successfully")
print(f"Response: {json.dumps(response.json(), indent=2)}")
print("\nNote: Sleep mode disable was attempted (will succeed if device is reachable)")
return True
else:
print(f"✗ Config update failed: {response.status_code}")
print(f"Error: {response.text}")
return False
def test_get_config():
"""Test retrieving the config"""
print("\n=== Testing Get Config ===")
response = requests.get(f"{BASE_URL}/{UNIT_ID}/config")
if response.status_code == 200:
print("✓ Config retrieved successfully")
print(f"Response: {json.dumps(response.json(), indent=2)}")
return True
elif response.status_code == 404:
print("✗ Config not found (create one first)")
return False
else:
print(f"✗ Request failed: {response.status_code}")
print(f"Error: {response.text}")
return False
def test_start_measurement():
"""Test that start measurement attempts to disable sleep mode"""
print("\n=== Testing Start Measurement ===")
print(f"Attempting to start measurement on {UNIT_ID}...")
response = requests.post(f"{BASE_URL}/{UNIT_ID}/start")
if response.status_code == 200:
print("✓ Start command accepted")
print(f"Response: {json.dumps(response.json(), indent=2)}")
print("\nNote: Sleep mode was disabled before starting measurement")
return True
elif response.status_code == 404:
print("✗ Device config not found (create config first)")
return False
elif response.status_code == 502:
print("✗ Device not reachable (expected if no physical device)")
print(f"Response: {response.text}")
print("\nNote: This is expected behavior when testing without a physical device")
return True # This is actually success - the endpoint tried to communicate
else:
print(f"✗ Request failed: {response.status_code}")
print(f"Error: {response.text}")
return False
def main():
print("=" * 60)
print("Sleep Mode Auto-Disable Test")
print("=" * 60)
print("\nThis test verifies that sleep mode is automatically disabled")
print("when device configs are updated or measurements are started.")
print("\nNote: Without a physical device, some operations will fail at")
print("the device communication level, but the API logic will execute.")
# Run tests
results = []
# Test 1: Update config (should attempt to disable sleep mode)
results.append(("Config Update", test_config_update()))
# Test 2: Get config
results.append(("Get Config", test_get_config()))
# Test 3: Start measurement (should attempt to disable sleep mode)
results.append(("Start Measurement", test_start_measurement()))
# Summary
print("\n" + "=" * 60)
print("Test Summary")
print("=" * 60)
for test_name, result in results:
status = "✓ PASS" if result else "✗ FAIL"
print(f"{status}: {test_name}")
print("\n" + "=" * 60)
print("Implementation Details:")
print("=" * 60)
print("1. Config endpoint is now async and calls ensure_sleep_mode_disabled()")
print(" when TCP is enabled")
print("2. Start measurement endpoint calls ensure_sleep_mode_disabled()")
print(" before starting the measurement")
print("3. Sleep mode check is non-blocking - config/start will succeed")
print(" even if the device is unreachable")
print("=" * 60)
if __name__ == "__main__":
main()