chore: version bump

fix: recognize 'Start' state when confirming measurement start
The /start handler waited for measurement_state == "Measure", but the device reports "Start" while measuring. The confirmation check therefore never matched, so the post-start status loop always ran its full 3x DOD retry cycle over cellular, pushing the call past ~10s. That blew past the Terra-View proxy's request timeout and surfaced to users as a misleading "Unknown error" even though the unit had already started recording. Match the device's actual reported state (and stay consistent with persist_snapshot's MEASURING_STATES handling) so /start confirms on the first attempt and returns promptly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
2026-06-22 20:54:43 +00:00 · 2026-06-21 20:22:39 +00:00 · 2026-06-11 23:40:52 +00:00 · 2026-06-11 22:47:39 +00:00 · 2026-06-11 19:36:16 +00:00 · 2026-06-11 03:29:16 +00:00
21 changed files with 5433 additions and 453 deletions
@@ -1,5 +1,8 @@
 /manuals/
 /data/
+/data-dev/
+/SLM-stress-test/stress_test_logs/
+/SLM-stress-test/tcpdump-runs/

 # Python cache
 __pycache__/
@@ -11,4 +14,6 @@ __pycache__/
 *.egg
 *.egg-info/
 dist/
-build/
+build/
+
+*.pcap
@@ -5,6 +5,116 @@ All notable changes to SLMM (Sound Level Meter Manager) will be documented in th
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

+## [0.4.0] - 2026-06-22
+
+### Added
+
+#### Live Monitor (fan-out feed)
+- **Per-device fan-out monitor** - one shared, cached live feed per device. Multiple clients (dashboards, portal, charts) subscribe to the same stream instead of each fighting for the NL-43's single TCP connection: one poller reads the device, all subscribers get the same frames.
+- **WebSocket monitor** - `WS /api/nl43/{unit_id}/monitor` delivers an instant first frame from cache, then live updates.
+- **Monitor control** - `POST /api/nl43/{unit_id}/monitor/{start|stop}`, `GET /api/nl43/_monitor/status`. A persistent `monitor_enabled` flag auto-starts the keepalive on boot.
+- **Adaptive polling** - poll rate adapts to demand; unreachable devices back off; a device-offline alert fires when a monitored unit drops.
+- **De-duplication** - the background poller skips units already covered by an active monitor (no double-polling); a heartbeat keeps the feed warm.
+- **Lower latency** - the monitor caches run state, roughly halving live-feed latency; fan-out emits an instant first frame + offline status to new clients.
+
+#### Alert Engine
+- **Threshold rules** - per-device alert rules (metric + threshold + cooldown) with full CRUD: `POST/GET/PUT/DELETE /api/nl43/{unit_id}/alerts/rules[/{rule_id}]`.
+- **Events + state machine** - onset/clear tracking via `GET /api/nl43/{unit_id}/alerts/events`; acknowledge with `POST .../events/{event_id}/ack`. A `cooldown_s` is enforced between onsets.
+- **24/7 evaluation** - enabled rules pin the monitor on, so rules evaluate continuously even with no UI client connected.
+- **Resilience** - editing or deleting a rule resets its state and closes any open event; device-offline events are raised when a monitored unit goes unreachable.
+
+#### Data & History
+- **Live-chart backfill** - a downsampled DOD trail is persisted to a new `nl43_readings` table, exposed via `GET /api/nl43/{unit_id}/history` so charts can backfill recent history on load.
+- **LN1/LN2 percentiles** - L1/L10 (configurable percentiles) surfaced through SLMM in the status and live-feed payloads.
+- **measurement_start_time** included in the cached `/status` response.
+
+#### Device control
+- **Per-device disconnect** - `POST /api/nl43/{unit_id}/disconnect` drops a device's pooled connection.
+- **Deactivate / standby** - `POST /api/nl43/{unit_id}/deactivate` and global `POST /api/nl43/_system/standby` to quiesce polling/monitoring.
+
+### Changed
+- **DRD streaming reuses the pooled connection** rather than opening a separate socket, avoiding contention with the persistent pool on a single-connection device.
+- **Connection pool** - idle-TTL / max-age checks can now be disabled; pool status is logged periodically.
+
+### Fixed
+- **Measurement-start confirmation** - `/start` now recognizes the device's `Start` state. It previously waited for `Measure`, which never matched, so the start cycle ran the full retry loop and Terra-View's proxy timed out with a misleading "Unknown error" even though the device had started.
+- **Garbled reads** - corrupted measurement-state reads that produced phantom STOPPED/STARTED transitions are now ignored.
+- **DOD parsing** - corrected field parsing and stopped spurious measurement-time resets.
+- **Monitor WebSocket** - quieted a send-after-close race on client disconnect.
+
+### Database
+- **New tables** (auto-created on startup via `Base.metadata.create_all`): `alert_rules`, `alert_events`, `nl43_readings`.
+- **Migrations for existing tables** (run once per database): `migrate_add_ln_percentiles.py` (LN1/LN2 on `nl43_status`), `migrate_add_monitor_enabled.py` (`monitor_enabled` on `nl43_config`).
+
+### Notes
+- Pairs with the matching Terra-View `dev` build, which reads SLMM's `/monitor` fan-out feed for live SLM dashboards (L1/L10 lines, live-chart backfill). Ship the two together.
+
+---
+
+## [0.3.0] - 2026-02-17
+
+### Added
+
+#### Persistent TCP Connection Pool
+- **Connection reuse** - TCP connections are cached per device and reused across commands, eliminating repeated TCP handshakes over cellular modems
+- **OS-level TCP keepalive** - Configurable keepalive probes keep cellular NAT tables alive and detect dead connections early (default: probe after 15s idle, every 10s, 3 failures = dead)
+- **Transparent retry** - If a cached connection goes stale, the system automatically retries with a fresh connection so failures are never visible to the caller
+- **Stale connection detection** - Multi-layer detection via idle TTL, max age, transport state, and reader EOF checks
+- **Background cleanup** - Periodic task (every 30s) evicts expired connections from the pool
+- **Master switch** - Set `TCP_PERSISTENT_ENABLED=false` to revert to per-request connection behavior
+
+#### Connection Pool Diagnostics
+- `GET /api/nl43/_connections/status` - View pool configuration, active connections, age/idle times, and keepalive settings
+- `POST /api/nl43/_connections/flush` - Force-close all cached connections (useful for debugging)
+- **Connections tab on roster page** - Live UI showing pool config, active connections with age/idle/alive status, auto-refreshes every 5s, and flush button
+
+#### Environment Variables
+- `TCP_PERSISTENT_ENABLED` (default: `true`) - Master switch for persistent connections
+- `TCP_IDLE_TTL` (default: `300`) - Close idle connections after N seconds
+- `TCP_MAX_AGE` (default: `1800`) - Force reconnect after N seconds
+- `TCP_KEEPALIVE_IDLE` (default: `15`) - Seconds idle before keepalive probes start
+- `TCP_KEEPALIVE_INTERVAL` (default: `10`) - Seconds between keepalive probes
+- `TCP_KEEPALIVE_COUNT` (default: `3`) - Failed probes before declaring connection dead
+
+### Changed
+- **Health check endpoint** (`/health/devices`) - Now uses connection pool instead of opening throwaway TCP connections; checks for existing live connections first (zero-cost), only opens new connection through pool if needed
+- **Diagnostics endpoint** - Removed separate port 443 modem check (extra handshake waste); TCP reachability test now uses connection pool
+- **DRD streaming** - Streaming connections now get TCP keepalive options set; cached connections are evicted before opening dedicated streaming socket
+- **Default timeouts tuned for cellular** - Idle TTL raised to 300s (5 min), max age raised to 1800s (30 min) to survive typical polling intervals over cellular links
+
+### Technical Details
+
+#### Architecture
+- `ConnectionPool` class in `services.py` manages a single cached connection per device key (NL-43 only supports one TCP connection at a time)
+- Uses existing per-device asyncio locks and rate limiting — no changes to concurrency model
+- Pool is a module-level singleton initialized from environment variables at import time
+- Lifecycle managed via FastAPI lifespan: cleanup task starts on startup, all connections closed on shutdown
+- `_send_command_unlocked()` refactored to use acquire/release/discard pattern with single-retry fallback
+- Command parsing extracted to `_execute_command()` method for reuse between primary and retry paths
+
+#### Cellular Modem Optimizations
+- Keepalive probes at 15s prevent cellular NAT tables from expiring (typically 30-60s timeout)
+- 300s idle TTL ensures connections survive between polling cycles (default 60s interval)
+- 1800s max age allows a single socket to serve ~30 minutes of polling before forced reconnect
+- Health checks and diagnostics produce zero additional TCP handshakes when a pooled connection exists
+- Stale `$` prompt bytes drained from idle connections before command reuse
+
+### Breaking Changes
+None. This release is fully backward-compatible with v0.2.x. Set `TCP_PERSISTENT_ENABLED=false` for identical behavior to previous versions.
+
+---
+
+## [0.2.1] - 2026-01-23
+
+### Added
+- **Roster management**: UI and API endpoints for managing device rosters.
+- **Delete config endpoint**: Remove device configuration alongside cached status data.
+- **Scheduler hooks**: `start_cycle` and `stop_cycle` helpers for Terra-View scheduling integration.
+
+### Changed
+- **FTP logging**: Connection, authentication, and transfer phases now log explicitly.
+- **Documentation**: Reorganized docs/scripts and updated API notes for FTP/TCP verification.
+
 ## [0.2.0] - 2026-01-15

 ### Added
@@ -135,5 +245,7 @@ None. This release is fully backward-compatible with v0.1.x. All existing endpoi

 ## Version History Summary

+- **v0.3.0** (2026-02-17) - Persistent TCP connections with keepalive for cellular modem reliability
+- **v0.2.1** (2026-01-23) - Roster management, scheduler hooks, FTP logging, doc cleanup
 - **v0.2.0** (2026-01-15) - Background Polling System
 - **v0.1.0** (2025-12-XX) - Initial Release
@@ -1,6 +1,6 @@
 # SLMM - Sound Level Meter Manager

-**Version 0.2.0**
+**Version 0.4.0**

 Backend API service for controlling and monitoring Rion NL-43/NL-53 Sound Level Meters via TCP and FTP protocols.

@@ -12,8 +12,12 @@ SLMM is a standalone backend module that provides REST API routing and command t

 ## Features

- **Background Polling** ⭐ NEW: Continuous automatic polling of devices with configurable intervals
- **Offline Detection** ⭐ NEW: Automatic device reachability tracking with failure counters
+- **Live Monitor (fan-out)**: One shared cached live feed per device — many clients subscribe to the same stream instead of fighting over the meter's single TCP connection
+- **Alert Engine**: Per-device threshold rules with onset/clear events, cooldowns, acks, and 24/7 evaluation
+- **History & Percentiles**: Downsampled DOD trail + history endpoint for live-chart backfill; LN1/LN2 (L1/L10) percentiles surfaced through the feed
+- **Persistent TCP Connections**: Cached per-device connections with OS-level keepalive, tuned for cellular modem reliability
+- **Background Polling**: Continuous automatic polling of devices with configurable intervals
+- **Offline Detection**: Automatic device reachability tracking with failure counters
 - **Device Management**: Configure and manage multiple NL43/NL53 devices
 - **Real-time Monitoring**: Stream live measurement data via WebSocket
 - **Measurement Control**: Start, stop, pause, resume, and reset measurements
@@ -22,6 +26,7 @@ SLMM is a standalone backend module that provides REST API routing and command t
 - **Device Configuration**: Manage frequency/time weighting, clock sync, and more
 - **Rate Limiting**: Automatic 1-second delay enforcement between device commands
 - **Persistent Storage**: SQLite database for device configs and measurement cache
+- **Connection Diagnostics**: Live UI and API endpoints for monitoring TCP connection pool status

 ## Architecture

@@ -29,29 +34,63 @@ SLMM is a standalone backend module that provides REST API routing and command t
 ┌─────────────────┐         ┌──────────────────────────────┐         ┌─────────────────┐
 │                 │◄───────►│  SLMM API                    │◄───────►│  NL43/NL53      │
 │  (Frontend)     │  HTTP   │  • REST Endpoints            │  TCP    │  Sound Meters   │
-└─────────────────┘         │  • WebSocket Streaming       │         └─────────────────┘
-                            │  • Background Poller ⭐ NEW  │                ▲
-                            └──────────────────────────────┘                │
-                                          │                         Continuous
-                                          ▼                          Polling
-                                  ┌──────────────┐                      │
-                                  │  SQLite DB   │◄─────────────────────┘
+└─────────────────┘         │  • WebSocket Streaming       │  (kept  │  (via cellular  │
+                            │  • Background Poller         │  alive) │   modem)        │
+                            │  • Connection Pool (v0.3)    │         └─────────────────┘
+                            └──────────────────────────────┘
+                                          │
+                                          ▼
+                                  ┌──────────────┐
+                                  │  SQLite DB   │
                                  │  • Config    │
                                  │  • Status    │
                                  └──────────────┘
 ```

+### Live Monitor — Fan-Out Feed (v0.4.0)
+
+The NL-43 allows only one TCP control connection at a time, so multiple clients
+polling the same device directly would contend for it. The monitor solves this
+with a single shared, cached feed per device:
+
+- **One reader, many subscribers**: a single poller reads the device; every
+  WebSocket subscriber (`WS /api/nl43/{unit_id}/monitor`) receives the same
+  frames — an instant first frame from cache, then live updates.
+- **Persistent + auto-start**: a `monitor_enabled` flag keeps the feed running
+  and auto-starts it on boot. Enabled alert rules pin the monitor on for 24/7
+  evaluation even with no UI connected.
+- **Adaptive & deduplicated**: poll rate adapts to demand, unreachable devices
+  back off, and the background poller skips units already covered by a monitor.
+
+### Alert Engine (v0.4.0)
+
+Per-device threshold alerting evaluated against the live feed:
+
+- **Rules**: metric + threshold + `cooldown_s`, full CRUD per device
+- **Events**: onset/clear state machine, acknowledgement, and a device-offline
+  alert when a monitored unit drops
+- **Robust**: editing/deleting a rule resets its state and closes open events
+
+### Persistent TCP Connection Pool (v0.3.0)
+
+SLMM maintains persistent TCP connections to devices with OS-level keepalive, designed for reliable operation over cellular modems:
+
+- **Connection Reuse**: One cached TCP socket per device, reused across all commands (no repeated handshakes)
+- **TCP Keepalive**: Probes keep cellular NAT tables alive and detect dead connections early
+- **Transparent Retry**: Stale cached connections automatically retry with a fresh socket
+- **Configurable**: Idle TTL (300s), max age (1800s), and keepalive timing via environment variables
+- **Diagnostics**: Live UI on the roster page and API endpoints for monitoring pool status
+
 ### Background Polling (v0.2.0)

-SLMM now includes a background polling service that continuously queries devices and updates the status cache:
+Background polling service continuously queries devices and updates the status cache:

 - **Automatic Updates**: Devices are polled at configurable intervals (10-3600 seconds)
 - **Offline Detection**: Devices marked unreachable after 3 consecutive failures
 - **Per-Device Configuration**: Each device can have a custom polling interval
 - **Resource Efficient**: Dynamic sleep intervals and smart scheduling
- **Graceful Shutdown**: Background task stops cleanly on service shutdown

-This makes Terra-View significantly more responsive - status requests return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds).
+Status requests return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds).

 ## Quick Start

@@ -96,9 +135,18 @@ Once running, visit:

 ### Environment Variables

+**Server:**
 - `PORT`: Server port (default: 8100)
 - `CORS_ORIGINS`: Comma-separated list of allowed origins (default: "*")

+**TCP Connection Pool:**
+- `TCP_PERSISTENT_ENABLED`: Enable persistent connections (default: "true")
+- `TCP_IDLE_TTL`: Close idle connections after N seconds (default: 300)
+- `TCP_MAX_AGE`: Force reconnect after N seconds (default: 1800)
+- `TCP_KEEPALIVE_IDLE`: Seconds idle before keepalive probes (default: 15)
+- `TCP_KEEPALIVE_INTERVAL`: Seconds between keepalive probes (default: 10)
+- `TCP_KEEPALIVE_COUNT`: Failed probes before declaring dead (default: 3)
+
 ### Database

 The SQLite database is automatically created at [data/slmm.db](data/slmm.db) on first run.
@@ -124,9 +172,33 @@ Logs are written to:
 |--------|----------|-------------|
 | GET | `/api/nl43/{unit_id}/status` | Get cached measurement snapshot (updated by background poller) |
 | GET | `/api/nl43/{unit_id}/live` | Request fresh DOD data from device (bypasses cache) |
+| GET | `/api/nl43/{unit_id}/history` | Downsampled DOD trail for live-chart backfill |
 | WS | `/api/nl43/{unit_id}/stream` | WebSocket stream for real-time DRD data |

-### Background Polling Configuration ⭐ NEW
+### Live Monitor (fan-out feed)
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| WS | `/api/nl43/{unit_id}/monitor` | Subscribe to the shared cached live feed (instant first frame) |
+| POST | `/api/nl43/{unit_id}/monitor/start` | Start the device's monitor feed |
+| POST | `/api/nl43/{unit_id}/monitor/stop` | Stop the device's monitor feed |
+| GET | `/api/nl43/_monitor/status` | Global monitor status across devices |
+| POST | `/api/nl43/{unit_id}/disconnect` | Drop the device's pooled TCP connection |
+| POST | `/api/nl43/{unit_id}/deactivate` | Quiesce polling/monitoring for one device |
+| POST | `/api/nl43/_system/standby` | Global standby — quiesce all polling/monitoring |
+
+### Alerts
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/nl43/{unit_id}/alerts/rules` | List alert rules for a device |
+| POST | `/api/nl43/{unit_id}/alerts/rules` | Create an alert rule (metric, threshold, cooldown) |
+| PUT | `/api/nl43/{unit_id}/alerts/rules/{rule_id}` | Update a rule (resets its state, closes open events) |
+| DELETE | `/api/nl43/{unit_id}/alerts/rules/{rule_id}` | Delete a rule |
+| GET | `/api/nl43/{unit_id}/alerts/events` | List alert events (onset/clear) |
+| POST | `/api/nl43/{unit_id}/alerts/events/{event_id}/ack` | Acknowledge an event |
+
+### Background Polling

 | Method | Endpoint | Description |
 |--------|----------|-------------|
@@ -134,6 +206,13 @@ Logs are written to:
 | PUT | `/api/nl43/{unit_id}/polling/config` | Update polling interval and enable/disable polling |
 | GET | `/api/nl43/_polling/status` | Get global polling status for all devices |

+### Connection Pool
+
+| Method | Endpoint | Description |
+|--------|----------|-------------|
+| GET | `/api/nl43/_connections/status` | Get pool config, active connections, age/idle times |
+| POST | `/api/nl43/_connections/flush` | Force-close all cached TCP connections |
+
 ### Measurement Control

 | Method | Endpoint | Description |
@@ -245,16 +324,43 @@ Caches latest measurement snapshot:
 - `sd_remaining_mb`: Free SD card space (MB)
 - `sd_free_ratio`: SD card free space ratio
 - `raw_payload`: Raw device response data
- `is_reachable`: Device reachability status (Boolean) ⭐ NEW
- `consecutive_failures`: Count of consecutive poll failures ⭐ NEW
- `last_poll_attempt`: Last time background poller attempted to poll ⭐ NEW
- `last_success`: Last successful poll timestamp ⭐ NEW
- `last_error`: Last error message (truncated to 500 chars) ⭐ NEW
+- `is_reachable`: Device reachability status (Boolean)
+- `consecutive_failures`: Count of consecutive poll failures
+- `last_poll_attempt`: Last time background poller attempted to poll
+- `last_success`: Last successful poll timestamp
+- `last_error`: Last error message (truncated to 500 chars)
+- `ln1` / `ln2`: LN1/LN2 (L1/L10) percentile levels ⭐ v0.4.0
+
+### NL43Readings Table ⭐ v0.4.0
+Downsampled DOD trail backing the live-chart history endpoint (one row/minute,
+pruned to a retention window — viewing only, not the report source):
+- `id` (PK), `unit_id`, `timestamp`
+- `lp` / `leq` / `lmax` / `ln1` / `ln2`: cached level samples
+
+### AlertRule Table ⭐ v0.4.0
+Per-device threshold alert rules:
+- `id` (PK), `unit_id`, `name`, `enabled`
+- `metric`, `comparison` (above/below), `threshold_db`, `clear_margin_db` (hysteresis)
+- `duration_s` (sustained), `cooldown_s` (min seconds between onsets)
+- `channels` / `recipients`, optional `schedule_start`/`schedule_end`/`schedule_days`
+
+### AlertEvent Table ⭐ v0.4.0
+Alert onset/clear events for history, inbox, and acknowledgement:
+- `id` (PK), `unit_id`, `rule_id`, `rule_name`, `metric`, `threshold_db`
+- `onset_at` / `onset_value`, `peak_value`, `clear_at`, `status` (active/cleared)
+- `acknowledged_at` / `acknowledged_by`, `notes`
+
+> New tables (`alert_rules`, `alert_events`, `nl43_readings`) auto-create on
+> startup. Existing-table columns ship with migrations:
+> `migrate_add_ln_percentiles.py`, `migrate_add_monitor_enabled.py`.

 ## Protocol Details

 ### TCP Communication
 - Uses ASCII command protocol over TCP
+- Persistent connections with OS-level keepalive (tuned for cellular modems)
+- Connections cached per device and reused across commands
+- Transparent retry on stale connections
 - Enforces ≥1 second delay between commands to same device
 - Two-line response format:
  - Line 1: Result code (R+0000 for success)
@@ -320,6 +426,16 @@ curl http://localhost:8100/api/nl43/meter-001/polling/config
 curl http://localhost:8100/api/nl43/_polling/status
 ```

+### Check Connection Pool Status
+```bash
+curl http://localhost:8100/api/nl43/_connections/status | jq '.'
+```
+
+### Flush All Cached Connections
+```bash
+curl -X POST http://localhost:8100/api/nl43/_connections/flush
+```
+
 ### Verify Device Settings
 ```bash
 curl http://localhost:8100/api/nl43/meter-001/settings
@@ -388,11 +504,19 @@ See [API.md](API.md) for detailed integration examples.
 ## Troubleshooting

 ### Connection Issues
+- Check connection pool status: `curl http://localhost:8100/api/nl43/_connections/status`
+- Flush stale connections: `curl -X POST http://localhost:8100/api/nl43/_connections/flush`
 - Verify device IP address and port in configuration
 - Ensure device is on the same network
 - Check firewall rules allow TCP/FTP connections
 - Verify RX55 network adapter is properly configured on device

+### Cellular Modem Issues
+- If modem wedges from too many handshakes, ensure `TCP_PERSISTENT_ENABLED=true` (default)
+- Increase `TCP_IDLE_TTL` if connections expire between poll cycles
+- Keepalive probes (default: every 15s) keep NAT tables alive — adjust `TCP_KEEPALIVE_IDLE` if needed
+- Set `TCP_PERSISTENT_ENABLED=false` to disable pooling for debugging
+
 ### Rate Limiting
 - API automatically enforces 1-second delay between commands
 - If experiencing delays, this is normal device behavior
@@ -0,0 +1,403 @@
+# NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps
+**Last updated:** 2026-02-18  
+**Owner:** Brian / serversdown  
+**Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55
+
+---
+
+## 0) What this document is
+This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43’s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable.
+
+This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**.
+
+---
+
+## 1) Architecture (as tested)
+### Network path
+- **Server (SLMM host):** `10.0.0.40`
+- **RX55 WAN IP:** `63.45.161.30`
+- **RX55 LAN subnet:** `192.168.1.0/24`
+- **RX55 LAN gateway:** `192.168.1.1`
+- **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation)
+
+### RX55 details
+- **Sierra Wireless RX55**
+- **OS:** 5.2
+- **Firmware:** `01.14.24.00`
+- **Carrier:** Verizon LTE (Band 66)
+
+### Port forwarding rules (RX55)
+- **WAN:2255 → NL-43:2255**  (NL-43 TCP control)
+- **WAN:21   → NL-43:21**    (NL-43 FTP control)
+
+You also experimented with additional forwards:
+- **WAN:2253 → NL-43:2255** (test)
+- **WAN:2253 → NL-43:2253** (test)
+- **WAN:4450 → NL-43:4450** (test)
+
+**Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x.
+
+---
+
+## 2) Original problem statement (the “wedge”)
+After running for hours, the NL-43 becomes unreachable over TCP control.
+
+### Symptom signature (WAN-side)
+- Client attempts to connect to `63.45.161.30:2255`
+- Instead of timing out, the client gets **connection refused** quickly.
+- Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop)
+
+### Critical operational behavior
+- **Power cycling the NL-43 fixes it.**
+- **Power cycling the RX55 does NOT fix it.**
+- FTP sometimes remains available even while TCP control (2255) is dead.
+
+This combination is what forced us to determine whether:
+- The RX55 is rejecting connections, OR
+- The NL-43 is no longer listening on 2255, OR
+- Something about the RX55 path triggers the NL-43’s control listener to die.
+
+---
+
+## 3) Event timeline evidence (SLMM logs)
+A concrete wedge window was observed on **2026-02-18**:
+
+- 10:55:46 AM — Poll success (Start)
+- 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded)
+- 11:55:50 AM — Poll success (Stop)
+- 12:55:55 PM — Poll success (Stop)
+- **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)**
+- 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused)
+
+Key interpretation:
+- The wedge occurred sometime between **12:55 and 1:55**.
+- The failure type is **refused**, not timeout.
+
+---
+
+## 4) Early hypotheses (before proof)
+We considered two main buckets:
+
+### A) NL-43-side failure (most suspicious)
+- NL-43 TCP control service crashes / exits / unbinds from 2255
+- socket leak / accept backlog exhaustion
+- “single control session allowed” and it gets stuck thinking a session is active
+- mode/service manager bug (service restart fails after other activities)
+- firmware bug in TCP daemon
+
+### B) RX55-side failure (possible trigger / less likely once FTP works)
+- NAT/forwarding table corruption
+- firewall behavior
+- helper/ALG interference
+- MSS/MTU weirdness causing edge-case behavior
+- session churn behavior causing downstream issues
+
+---
+
+## 5) Key experiments and what they proved
+
+### 5.1) LAN-only stability test (No RX55 path)
+**Test:** NL-43 tested directly on LAN (no modem path involved).
+- Ran **24+ hours**
+- Scheduler start/stop cycles worked
+- Stress test: **500 commands @ 1/sec** → no failure
+- Response time trend decreased (not degrading)
+
+**Result:** The NL-43 appears stable in a “pure LAN” environment.
+
+**Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime.
+
+---
+
+### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization)
+You observed:
+
+- **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop)
+- **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST)
+
+Concrete example:
+- Port **4450** with no rule → timeout
+- Port **4450 → NL-43:4450** rule created → connection refused
+
+**Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target.
+
+Important nuance:
+- A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop.
+
+---
+
+### 5.3) UDP test confusion (and resolution)
+You ran:
+
+```bash
+nc -vzu 63.45.161.30 2255
+nc -vz  63.45.161.30 2255
+```
+
+Observed:
+- UDP: “succeeded”
+- TCP: “connection refused”
+
+Resolution:
+- UDP has **no handshake**. netcat prints “succeeded” if it doesn’t immediately receive an ICMP unreachable. It does **not** mean a UDP service exists.
+- TCP refused is meaningful: a RST implies “no listener” or “actively rejected.”
+
+**Net effect:** UDP test did not change the diagnosis.
+
+---
+
+### 5.4) Packet capture proof (WAN-side)
+You captured a Wireshark/tcpdump summary with these key patterns:
+
+#### Port 2255 (TCP control)
+Example:
+- `10.0.0.40 → 63.45.161.30:2255` SYN
+- `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms
+
+This happened repeatedly.
+
+#### Port 2253 (test port)
+Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment).
+
+#### Port 21 (FTP)
+Clean 3-way handshake:
+- SYN → SYN/ACK → ACK
+Then:
+- FTP server banner: `220 Connection Ready`
+Then:
+- `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment)
+Session closes cleanly.
+
+**Key takeaway from capture:**
+- TCP transport to NL-43 via RX55 is definitely working (port 21 proves it).
+- Port 2255 is being actively refused.
+
+This strongly suggested “2255 listener is gone,” but still didn’t fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43.
+
+---
+
+## 6) The decisive experiment: LAN-side test while wedged (final proof)
+Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**.
+
+### 6.1) Physical LAN tap setup
+Constraint:
+- NL-43 has only one Ethernet port.
+
+Solution:
+- Insert an unmanaged switch:
+  - RX55 LAN → switch
+  - NL-43 → switch
+  - Windows 10 laptop → switch
+
+This creates a shared L2 segment where the laptop can test NL-43 directly.
+
+### 6.2) Windows LAN validation
+On the Windows laptop:
+
+- `ipconfig` showed:
+  - IP: `192.168.1.100`
+  - Gateway: `192.168.1.1` (RX55)
+- Initial `arp -a` only showed RX55, not NL-43.
+
+You then:
+- pinged likely host addresses and discovered NL-43 responds on **192.168.1.10**
+- `arp -a` then showed:
+  - `192.168.1.10 → 00-10-50-14-0a-d8`
+  - OUI `00-10-50` recognized as **Rion** (matches NL-43)
+
+So LAN identities were confirmed:
+- RX55: `192.168.1.1`
+- NL-43: `192.168.1.10`
+
+### 6.3) The LAN port tests (the smoking gun)
+From Windows:
+
+```powershell
+Test-NetConnection -ComputerName 192.168.1.10 -Port 2255
+Test-NetConnection -ComputerName 192.168.1.10 -Port 21
+```
+
+Results (while the unit was “wedged” from the WAN perspective):
+- **2255:** `TcpTestSucceeded : False`
+- **21:**   `TcpTestSucceeded : True`
+
+**Conclusion (PROVEN):**
+- The NL-43 is reachable on the LAN
+- FTP port 21 is alive
+- **The NL-43 is NOT listening on TCP port 2255**
+- Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255.
+
+This is now settled.
+
+---
+
+## 7) What we learned (final conclusions)
+### 7.1) RX55 innocence (for this failure mode)
+The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared.
+
+It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*.
+
+### 7.2) NL-43 control listener failure
+The NL-43’s TCP control service (port 2255) stops listening while:
+- the device remains alive
+- the LAN stack remains alive (ping)
+- FTP remains alive (port 21)
+
+This looks like one of:
+- control daemon crash/exit
+- service unbind
+- stuck service state (e.g., “busy” / “session active forever”)
+- resource leak (sockets/file descriptors) specific to the control service
+- firmware service manager bug (start/stop of services fails after certain sequences)
+
+---
+
+## 8) Additional constraint discovered: “Web App mode” conflicts
+You noted an important operational constraint:
+
+> Turning on the web app disables other interfaces like TCP and FTP.
+
+Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because:
+- If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle.
+- It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed.
+
+We have not yet run a controlled long test to determine whether:
+- mode switching contributes directly to the 2255 listener dying, OR
+- it happens even in a pure TCP-only mode with no switching.
+
+---
+
+## 9) Immediate operational decision (field tomorrow)
+Because the device is needed in the field immediately, you chose:
+- **Old-school manual deployment**
+- **Manual SD card downloads**
+- Avoid reliance on 2255/TCP control and remote workflows for now.
+
+**Important operational note:**
+The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency.
+
+---
+
+## 10) What’s next (future work — when the unit is back)
+Because long tests can’t be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation.
+
+### 10.1) Controlled experiment matrix (recommended)
+Run each test for 24–72 hours, or until wedge occurs, and record:
+- number of TCP connects
+- whether connections are persistent
+- whether FTP is used
+- whether any mode toggling is performed
+- time-to-wedge
+
+#### Test A — TCP-only (ideal baseline)
+- TCP control only (2255)
+- **True persistent connection** (open once, keep forever)
+- No FTP
+- No web mode toggling
+
+Outcome interpretation:
+- If stable: connection churn and/or FTP/mode switching is the trigger.
+- If wedges anyway: pure 2255 daemon leak/bug.
+
+#### Test B — TCP with connection churn
+- Same as A but intentionally reconnect on a schedule (current SLMM behavior)
+- No FTP
+
+Outcome:
+- If this wedges but A doesn’t: churn is the trigger.
+
+#### Test C — FTP activity + TCP
+- Introduce scheduled FTP sessions (downloads) while using TCP control
+- Observe whether wedge correlates with FTP use or with post-download periods.
+
+Outcome:
+- If wedge correlates with FTP, suspect internal service lifecycle conflict.
+
+#### Test D — Web mode interaction (only if safe/possible)
+- Evaluate what toggling web mode does to TCP/FTP services.
+- Determine if any remote-safe “soft reset” exists.
+
+---
+
+## 11) Mitigation options (ranked)
+### Option 1 — Make SLMM truly persistent (highest probability of success)
+If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is:
+- Open one TCP socket per device
+- Keep it open indefinitely
+- Use OS keepalive
+- Do **not** rotate connections on timers
+- Reconnect only when the socket actually dies
+
+This reduces:
+- connect/close cycles
+- NAT edge-case exposure
+- resource churn inside NL-43
+
+### Option 2 — Service “soft reset” (if possible without disabling required services)
+If there exists any way to restart the 2255 service without power cycling:
+- LAN TCP toggle (if it doesn’t require web mode)
+- any “restart comms” command (unknown)
+- any maintenance menu sequence
+then SLMM could:
+- detect wedge
+- trigger soft reset
+- recover automatically
+
+Current constraint: web app mode appears to disable other services, so this may not be viable.
+
+### Option 3 — Hardware watchdog power cycle (industrial but reliable)
+If this is a firmware bug with no clean workaround:
+- Add a remotely controlled relay/power switch
+- On wedge detection, power-cycle NL-43 automatically
+- Optionally schedule a nightly power cycle to prevent leak accumulation
+
+This is “field reality” and often the only long-term move with embedded devices.
+
+### Option 4 — Vendor escalation (Rion)
+You now have excellent evidence:
+- LAN-side proof: 2255 dead while 21 alive
+- WAN packet evidence
+- clear isolation of RX55 innocence
+
+This is strong enough to send to Rion support as a firmware defect report.
+
+---
+
+## 12) Repro “wedge bundle” checklist (for future captures)
+When the wedge happens again, capture these before power cycling:
+
+1) From server:
+- `nc -vz 63.45.161.30 2255` (expect refused)
+- `nc -vz 63.45.161.30 21`   (expect success if FTP alive)
+
+2) From LAN side (via switch/laptop):
+- `Test-NetConnection 192.168.1.10 -Port 2255`
+- `Test-NetConnection 192.168.1.10 -Port 21`
+
+3) Optional: packet capture around the refused attempt.
+
+4) Record:
+- last successful poll timestamp
+- last FTP session timestamp
+- any scheduled start/stop/download cycles near wedge time
+- SLMM connection reuse/rotation settings in effect
+
+---
+
+## 13) Final, current-state summary (as of 2026-02-18)
+- The issue is **NOT** the RX55 rejecting inbound connections.
+- The NL-43 is **alive**, reachable on LAN, and FTP works.
+- The NL-43’s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy.
+- The wedge can occur hours after successful operations.
+- The unit is needed in the field immediately, so investigation pauses.
+- Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset).
+
+---
+
+## 14) Notes / misc observations
+- The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMM’s “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow.
+- UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned.
+
+---
+
+**End of document.**
@@ -0,0 +1,322 @@
+"""
+Threshold alert engine.
+
+Each unit can have any number of AlertRules. A rule is evaluated against the
+unit's live monitor snapshots via a small per-(unit, rule) state machine:
+
+    IDLE  --(metric exceeds threshold for duration_s)-->  ACTIVE   (fire ONSET)
+    ACTIVE --(metric recovers past hysteresis for duration_s)--> IDLE (fire CLEAR)
+
+duration_s debounces both edges; clear_margin_db adds hysteresis so a level
+hovering at the threshold doesn't flap. Onset and clear are distinct events.
+
+The state-machine logic (`_evaluate_step`) is intentionally pure — no DB, no
+real clock — so it can be unit-tested with a synthetic level series and a fake
+clock. The AlertEvaluator wraps it with rule loading, scheduling, persistence,
+and dispatch. Dispatch is a server log for now (POC); the seam to POST events to
+a Terra-View webhook (email/SMS) is _dispatch().
+"""
+
+import asyncio
+import logging
+import os
+from dataclasses import dataclass
+from datetime import datetime, timedelta
+from typing import Dict, List, Optional, Tuple
+
+logger = logging.getLogger(__name__)
+
+# Local timezone offset for schedule windows (same env var services.py uses).
+_TZ_OFFSET_HOURS = float(os.getenv("TIMEZONE_OFFSET", "-5"))
+
+# How long to cache a unit's rules before re-querying the DB (rules change rarely).
+_RULE_CACHE_TTL_S = 15.0
+
+
+@dataclass
+class RuleState:
+    """In-memory runtime state for one (unit, rule)."""
+    phase: str = "idle"                 # "idle" | "active"
+    edge_since: Optional[float] = None  # when the current edge condition began (clock time)
+    peak: float = 0.0
+    event_id: Optional[int] = None      # the open AlertEvent row (for the clear update)
+    last_onset: Optional[float] = None  # time of the last onset (for cooldown)
+
+
+def _exceeds(value: float, rule) -> bool:
+    if rule.comparison == "below":
+        return value < rule.threshold_db
+    return value > rule.threshold_db
+
+
+def _recovered(value: float, rule) -> bool:
+    margin = rule.clear_margin_db or 0.0
+    if rule.comparison == "below":
+        return value > rule.threshold_db + margin
+    return value < rule.threshold_db - margin
+
+
+def _evaluate_step(state: RuleState, value: float, now: float, rule) -> Optional[str]:
+    """Advance the state machine by one reading.
+
+    Pure: mutates `state`, returns 'onset' | 'clear' | None. `now` is injected so
+    tests can drive a fake clock.
+    """
+    duration = rule.duration_s or 0
+
+    if state.phase == "idle":
+        if _exceeds(value, rule):
+            if state.edge_since is None:
+                state.edge_since = now
+            if now - state.edge_since >= duration:
+                # Cooldown: suppress a new onset within cooldown_s of the last one
+                # (stops a repeatedly-breaching signal from flooding the history).
+                # Hold edge_since so it fires the moment cooldown lapses if still
+                # breaching — don't reset it here.
+                cooldown = getattr(rule, "cooldown_s", 0) or 0
+                if state.last_onset is not None and (now - state.last_onset) < cooldown:
+                    return None
+                state.phase = "active"
+                state.edge_since = None
+                state.peak = value
+                state.last_onset = now
+                return "onset"
+        else:
+            state.edge_since = None
+        return None
+
+    # active
+    if rule.comparison == "below":
+        state.peak = min(state.peak, value)
+    else:
+        state.peak = max(state.peak, value)
+
+    if _recovered(value, rule):
+        if state.edge_since is None:
+            state.edge_since = now
+        if now - state.edge_since >= duration:
+            state.phase = "idle"
+            state.edge_since = None
+            return "clear"
+    else:
+        state.edge_since = None
+    return None
+
+
+def _in_window(now_minutes: int, start: str, end: str) -> bool:
+    """Is now_minutes (minutes since local midnight) within [start, end)?
+    Handles wraparound windows like 22:00–07:00."""
+    def _m(s: str) -> int:
+        h, m = s.split(":")
+        return int(h) * 60 + int(m)
+    s, e = _m(start), _m(end)
+    if s == e:
+        return True
+    if s < e:
+        return s <= now_minutes < e
+    return now_minutes >= s or now_minutes < e  # wraparound
+
+
+class AlertEvaluator:
+    def __init__(self):
+        self._states: Dict[Tuple[str, int], RuleState] = {}
+        self._rule_cache: Dict[str, Tuple[float, list]] = {}  # unit_id -> (fetched_at, rules)
+        self._offline_events: Dict[str, int] = {}  # unit_id -> open connectivity AlertEvent id
+        logger.info("[ALERT] rule-based evaluator ready")
+
+    async def evaluate(self, unit_id: str, snap) -> None:
+        """Evaluate every enabled rule for this unit against one snapshot."""
+        rules = self._get_rules(unit_id)
+        if not rules:
+            return
+        now = asyncio.get_running_loop().time()
+        for rule in rules:
+            if not self._in_schedule(rule):
+                continue
+            raw = getattr(snap, rule.metric, None)
+            try:
+                value = float(raw)
+            except (TypeError, ValueError):
+                continue  # missing / non-numeric ("-.-")
+            state = self._states.setdefault((unit_id, rule.id), RuleState())
+            action = _evaluate_step(state, value, now, rule)
+            if action == "onset":
+                await self._on_onset(unit_id, rule, value, state)
+            elif action == "clear":
+                await self._on_clear(unit_id, rule, value, state)
+
+    # -- rule loading (cached) ----------------------------------------------
+
+    def _get_rules(self, unit_id: str) -> list:
+        loop_now = asyncio.get_running_loop().time()
+        cached = self._rule_cache.get(unit_id)
+        if cached and loop_now - cached[0] < _RULE_CACHE_TTL_S:
+            return cached[1]
+        rules = self._load_rules(unit_id)
+        self._rule_cache[unit_id] = (loop_now, rules)
+        return rules
+
+    def _load_rules(self, unit_id: str) -> list:
+        from app.database import SessionLocal
+        from app.models import AlertRule
+        db = SessionLocal()
+        try:
+            return db.query(AlertRule).filter_by(unit_id=unit_id, enabled=True).all()
+        except Exception as e:
+            logger.warning(f"[ALERT] failed to load rules for {unit_id}: {e}")
+            return []
+        finally:
+            db.close()
+
+    def invalidate(self, unit_id: Optional[str] = None) -> None:
+        """Drop cached rules so a change is picked up immediately."""
+        if unit_id is None:
+            self._rule_cache.clear()
+        else:
+            self._rule_cache.pop(unit_id, None)
+
+    def forget_rule(self, unit_id: str, rule_id: int) -> None:
+        """Drop a rule's per-(unit, rule) state machine after the rule is edited or
+        deleted, so a stale 'active' phase / open event_id from the old config
+        doesn't bleed into the new one (mis-firing a clear or suppressing an onset)."""
+        self._states.pop((unit_id, rule_id), None)
+
+    # -- scheduling ----------------------------------------------------------
+
+    def _in_schedule(self, rule) -> bool:
+        if not rule.schedule_start or not rule.schedule_end:
+            day_ok = self._day_ok(rule)
+            return day_ok
+        local = datetime.utcnow() + timedelta(hours=_TZ_OFFSET_HOURS)
+        if not self._day_ok(rule, local):
+            return False
+        return _in_window(local.hour * 60 + local.minute, rule.schedule_start, rule.schedule_end)
+
+    @staticmethod
+    def _day_ok(rule, local: Optional[datetime] = None) -> bool:
+        if not rule.schedule_days:
+            return True
+        if local is None:
+            local = datetime.utcnow() + timedelta(hours=_TZ_OFFSET_HOURS)
+        allowed = {int(d) for d in str(rule.schedule_days).split(",") if d.strip() != ""}
+        return local.weekday() in allowed  # Mon=0
+
+    # -- event persistence + dispatch ---------------------------------------
+
+    async def _on_onset(self, unit_id: str, rule, value: float, state: RuleState) -> None:
+        from app.database import SessionLocal
+        from app.models import AlertEvent
+        db = SessionLocal()
+        try:
+            evt = AlertEvent(
+                rule_id=rule.id, unit_id=unit_id, rule_name=rule.name,
+                metric=rule.metric, threshold_db=rule.threshold_db,
+                onset_value=value, peak_value=value, status="active",
+            )
+            db.add(evt)
+            db.commit()
+            db.refresh(evt)
+            state.event_id = evt.id
+        except Exception as e:
+            logger.warning(f"[ALERT] failed to record onset for {unit_id}: {e}")
+        finally:
+            db.close()
+        await self._dispatch(
+            "ONSET", unit_id, rule,
+            f"{rule.metric.upper()}={value:.1f} dB "
+            f"{'<' if rule.comparison == 'below' else '>'} {rule.threshold_db:.1f} dB"
+            f"{f' for {rule.duration_s}s' if rule.duration_s else ''}",
+        )
+
+    async def _on_clear(self, unit_id: str, rule, value: float, state: RuleState) -> None:
+        peak = state.peak
+        from app.database import SessionLocal
+        from app.models import AlertEvent
+        db = SessionLocal()
+        try:
+            if state.event_id is not None:
+                evt = db.query(AlertEvent).filter_by(id=state.event_id).first()
+                if evt:
+                    evt.clear_at = datetime.utcnow()
+                    evt.peak_value = peak
+                    evt.status = "cleared"
+                    db.commit()
+        except Exception as e:
+            logger.warning(f"[ALERT] failed to record clear for {unit_id}: {e}")
+        finally:
+            db.close()
+        state.event_id = None
+        await self._dispatch(
+            "CLEAR", unit_id, rule,
+            f"recovered to {value:.1f} dB (peak {peak:.1f} dB)",
+        )
+
+    # -- connectivity (device offline/online) -------------------------------
+    #
+    # Raised by the live monitor when it loses / regains contact with a device.
+    # Persisted as an AlertEvent (sentinel rule_id=0, metric="connectivity") so it
+    # lands in the same events/inbox/ack pipeline as threshold alerts. The in-memory
+    # map dedupes; the DB query also dedupes across a process restart.
+
+    async def device_offline(self, unit_id: str) -> None:
+        if unit_id in self._offline_events:
+            return  # already flagged offline
+        from app.database import SessionLocal
+        from app.models import AlertEvent
+        db = SessionLocal()
+        try:
+            existing = db.query(AlertEvent).filter_by(
+                unit_id=unit_id, metric="connectivity", status="active").first()
+            if existing:  # already open in the DB (e.g. carried across a restart)
+                self._offline_events[unit_id] = existing.id
+                return
+            evt = AlertEvent(
+                rule_id=0, unit_id=unit_id, rule_name="Device unreachable",
+                metric="connectivity", threshold_db=0.0, status="active",
+            )
+            db.add(evt)
+            db.commit()
+            db.refresh(evt)
+            self._offline_events[unit_id] = evt.id
+        except Exception as e:
+            logger.warning(f"[ALERT] failed to record offline for {unit_id}: {e}")
+        finally:
+            db.close()
+        await self._dispatch_raw("OFFLINE", unit_id, "Device unreachable",
+                                 "live monitor lost contact with the device")
+
+    async def device_online(self, unit_id: str) -> None:
+        self._offline_events.pop(unit_id, None)
+        from app.database import SessionLocal
+        from app.models import AlertEvent
+        db = SessionLocal()
+        cleared = 0
+        try:
+            opened = db.query(AlertEvent).filter_by(
+                unit_id=unit_id, metric="connectivity", status="active").all()
+            for evt in opened:
+                evt.clear_at = datetime.utcnow()
+                evt.status = "cleared"
+                cleared += 1
+            if cleared:
+                db.commit()
+        except Exception as e:
+            logger.warning(f"[ALERT] failed to record online for {unit_id}: {e}")
+        finally:
+            db.close()
+        if cleared:  # only announce recovery if it was actually flagged offline
+            await self._dispatch_raw("ONLINE", unit_id, "Device recovered",
+                                     "live monitor regained contact with the device")
+
+    # -- event persistence + dispatch ---------------------------------------
+
+    async def _dispatch(self, kind: str, unit_id: str, rule, detail: str) -> None:
+        await self._dispatch_raw(kind, unit_id, rule.name, detail)
+
+    async def _dispatch_raw(self, kind: str, unit_id: str, name: str, detail: str) -> None:
+        """POC dispatch: server log. Swap in a Terra-View webhook (email/SMS) here."""
+        logger.warning(f"[ALERT:{kind}] {unit_id} '{name}': {detail}")
+
+
+# Module-level singleton (the monitor calls alert_evaluator.evaluate per snapshot)
+alert_evaluator = AlertEvaluator()
@@ -8,6 +8,7 @@ for fast API access without querying devices on every request.

 import asyncio
 import logging
+import os
 from datetime import datetime, timedelta
 from typing import Optional

@@ -15,17 +16,23 @@ from sqlalchemy.orm import Session

 from app.database import SessionLocal
 from app.models import NL43Config, NL43Status
-from app.services import NL43Client, persist_snapshot
+from app.services import NL43Client, persist_snapshot, sync_measurement_start_time_from_ftp
+from app.device_logger import log_device_event, cleanup_old_logs

 logger = logging.getLogger(__name__)

+# Global polling default. Set SLMM_POLLING_ENABLED=false to start an instance in
+# standby (running but not polling and not holding device connections) — e.g. a
+# dev box that must not latch onto a device that a prod instance owns.
+POLLING_ENABLED_DEFAULT = os.getenv("SLMM_POLLING_ENABLED", "true").lower() == "true"
+

 class BackgroundPoller:
    """
    Background task that continuously polls NL43 devices and updates status cache.

    Features:
-    - Per-device configurable poll intervals (10-3600 seconds)
+    - Per-device configurable poll intervals (30 seconds to 6 hours)
    - Automatic offline detection (marks unreachable after 3 consecutive failures)
    - Dynamic sleep intervals based on device configurations
    - Graceful shutdown on application stop
@@ -36,6 +43,9 @@ class BackgroundPoller:
        self._task: Optional[asyncio.Task] = None
        self._running = False
        self._logger = logger
+        self._last_cleanup = None  # Track last log cleanup time
+        self._last_pool_log = None  # Track last connection pool heartbeat log
+        self._active = POLLING_ENABLED_DEFAULT  # Global polling on/off (standby toggle)

    async def start(self):
        """Start the background polling task."""
@@ -68,15 +78,75 @@ class BackgroundPoller:

        self._logger.info("Background poller stopped")

+    def is_active(self) -> bool:
+        """Whether background polling is currently active (vs standby)."""
+        return self._active
+
+    async def set_active(self, active: bool):
+        """Globally enable/disable polling at runtime.
+
+        When deactivated, the loop stays alive but polls nothing and releases all
+        device connections, so this SLMM instance stops occupying the devices'
+        single connection slots (e.g. so a prod instance can take over). Runtime
+        state only — on restart the instance returns to SLMM_POLLING_ENABLED.
+        """
+        self._active = active
+        if active:
+            self._logger.info("[SYSTEM] Background polling ACTIVATED")
+        else:
+            self._logger.info("[SYSTEM] Background polling DEACTIVATED (standby) — releasing connections")
+            await self._release_all_connections()
+
+    async def _release_all_connections(self):
+        """Gracefully close every pooled device connection (no-op if none)."""
+        from app.services import _connection_pool
+        for device_key in list(_connection_pool.get_stats().get("connections", {})):
+            await _connection_pool.discard(device_key)
+
    async def _poll_loop(self):
        """Main polling loop that runs continuously."""
        self._logger.info("Background polling loop started")

        while self._running:
+            if self._active:
+                try:
+                    await self._poll_all_devices()
+                except Exception as e:
+                    self._logger.error(f"Error in poll loop: {e}", exc_info=True)
+            else:
+                # Standby: poll nothing, and keep holding no device connection slots
+                # so another SLMM instance (e.g. prod) can talk to the devices.
+                try:
+                    await self._release_all_connections()
+                except Exception as e:
+                    self._logger.warning(f"Standby connection release failed: {e}")
+
+            # Run log cleanup once per hour
            try:
-                await self._poll_all_devices()
+                now = datetime.utcnow()
+                if self._last_cleanup is None or (now - self._last_cleanup).total_seconds() > 3600:
+                    cleanup_old_logs()
+                    self._last_cleanup = now
            except Exception as e:
-                self._logger.error(f"Error in poll loop: {e}", exc_info=True)
+                self._logger.warning(f"Log cleanup failed: {e}")
+
+            # Log connection pool status every 15 minutes
+            try:
+                now = datetime.utcnow()
+                if self._last_pool_log is None or (now - self._last_pool_log).total_seconds() > 900:
+                    from app.services import _connection_pool
+                    stats = _connection_pool.get_stats()
+                    conns = stats.get("connections", {})
+                    if conns:
+                        for key, c in conns.items():
+                            self._logger.info(
+                                f"[POOL] {key} — age={c['age_seconds']}s idle={c['idle_seconds']}s alive={c['alive']}"
+                            )
+                    else:
+                        self._logger.info("[POOL] No active connections in pool")
+                    self._last_pool_log = now
+            except Exception as e:
+                self._logger.warning(f"Pool status log failed: {e}")

            # Calculate dynamic sleep interval
            sleep_time = self._calculate_sleep_interval()
@@ -108,10 +178,19 @@ class BackgroundPoller:
            now = datetime.utcnow()
            polled_count = 0

+            from app.monitor import monitor_manager
+
            for cfg in configs:
                if not self._running:
                    break

+                # Skip units with an active live monitor: it polls them at ~1Hz and
+                # keeps the status cache fresh, so a redundant background poll would just
+                # add load/lock-contention on the device's single connection.
+                if monitor_manager.is_active(cfg.unit_id):
+                    self._logger.debug(f"Skipping {cfg.unit_id} — live monitor active")
+                    continue
+
                # Get current status
                status = db.query(NL43Status).filter_by(unit_id=cfg.unit_id).first()

@@ -205,6 +284,71 @@ class BackgroundPoller:
            db.commit()
            self._logger.info(f"✓ Successfully polled {unit_id}")

+            # Log to device log
+            log_device_event(
+                unit_id, "INFO", "POLL",
+                f"Poll success: state={snap.measurement_state}, Leq={snap.leq}, Lp={snap.lp}",
+                db
+            )
+
+            # Check if device is measuring but has no start time recorded
+            # This happens if measurement was started before SLMM began polling
+            # or after a service restart
+            status = db.query(NL43Status).filter_by(unit_id=unit_id).first()
+
+            # Reset the sync flag when measurement stops (so next measurement can sync)
+            if status and status.measurement_state != "Start":
+                if status.start_time_sync_attempted:
+                    status.start_time_sync_attempted = False
+                    db.commit()
+                    self._logger.debug(f"Reset FTP sync flag for {unit_id} (measurement stopped)")
+                    log_device_event(unit_id, "DEBUG", "STATE", "Measurement stopped, reset FTP sync flag", db)
+
+            # Attempt FTP sync if:
+            # - Device is measuring
+            # - No start time recorded
+            # - FTP sync not already attempted for this measurement
+            # - FTP is configured
+            if (status and
+                status.measurement_state == "Start" and
+                status.measurement_start_time is None and
+                not status.start_time_sync_attempted and
+                cfg.ftp_enabled and
+                cfg.ftp_username and
+                cfg.ftp_password):
+
+                self._logger.info(
+                    f"Device {unit_id} is measuring but has no start time - "
+                    f"attempting FTP sync"
+                )
+                log_device_event(unit_id, "INFO", "SYNC", "Attempting FTP sync for measurement start time", db)
+
+                # Mark that we attempted sync (prevents repeated attempts on failure)
+                status.start_time_sync_attempted = True
+                db.commit()
+
+                try:
+                    synced = await sync_measurement_start_time_from_ftp(
+                        unit_id=unit_id,
+                        host=cfg.host,
+                        tcp_port=cfg.tcp_port,
+                        ftp_port=cfg.ftp_port or 21,
+                        ftp_username=cfg.ftp_username,
+                        ftp_password=cfg.ftp_password,
+                        db=db
+                    )
+                    if synced:
+                        self._logger.info(f"✓ FTP sync succeeded for {unit_id}")
+                        log_device_event(unit_id, "INFO", "SYNC", "FTP sync succeeded - measurement start time updated", db)
+                    else:
+                        self._logger.warning(f"FTP sync returned False for {unit_id}")
+                        log_device_event(unit_id, "WARNING", "SYNC", "FTP sync returned False", db)
+                except Exception as sync_err:
+                    self._logger.warning(
+                        f"FTP sync failed for {unit_id}: {sync_err}"
+                    )
+                    log_device_event(unit_id, "ERROR", "SYNC", f"FTP sync failed: {sync_err}", db)
+
        except Exception as e:
            # Failure - increment counter and potentially mark offline
            status.consecutive_failures += 1
@@ -217,11 +361,13 @@ class BackgroundPoller:
                    self._logger.warning(
                        f"Device {unit_id} marked unreachable after {status.consecutive_failures} failures: {error_msg}"
                    )
+                    log_device_event(unit_id, "ERROR", "POLL", f"Device marked UNREACHABLE after {status.consecutive_failures} failures: {error_msg}", db)
                status.is_reachable = False
            else:
                self._logger.warning(
                    f"Poll failed for {unit_id} (attempt {status.consecutive_failures}/3): {error_msg}"
                )
+                log_device_event(unit_id, "WARNING", "POLL", f"Poll failed (attempt {status.consecutive_failures}/3): {error_msg}", db)

            db.commit()

@@ -230,8 +376,8 @@ class BackgroundPoller:
        Calculate the next sleep interval based on all device poll intervals.

        Returns a dynamic sleep time that ensures responsive polling:
-        - Minimum 10 seconds (prevents tight loops)
-        - Maximum 30 seconds (ensures responsiveness)
+        - Minimum 30 seconds (prevents tight loops)
+        - Maximum 300 seconds / 5 minutes (ensures reasonable responsiveness for long intervals)
        - Generally half the minimum device interval

        Returns:
@@ -245,14 +391,15 @@ class BackgroundPoller:
            ).all()

            if not configs:
-                return 30  # Default sleep when no devices configured
+                return 60  # Default sleep when no devices configured

            # Get all intervals
            intervals = [cfg.poll_interval_seconds or 60 for cfg in configs]
            min_interval = min(intervals)

-            # Use half the minimum interval, but cap between 10-30 seconds
-            sleep_time = max(10, min(30, min_interval // 2))
+            # Use half the minimum interval, but cap between 30-300 seconds
+            # This allows longer sleep times when polling intervals are long (e.g., hourly)
+            sleep_time = max(30, min(300, min_interval // 2))

            return sleep_time

@@ -0,0 +1,277 @@
+"""
+Per-device logging system.
+
+Provides dual output: database entries for structured queries and file logs for backup.
+Each device gets its own log file in data/logs/{unit_id}.log with rotation.
+"""
+
+import logging
+import os
+from datetime import datetime, timedelta
+from logging.handlers import RotatingFileHandler
+from pathlib import Path
+from typing import Optional
+
+from sqlalchemy.orm import Session
+
+from app.database import SessionLocal
+from app.models import DeviceLog
+
+# Configure base logger
+logger = logging.getLogger(__name__)
+
+# Log directory (persisted in Docker volume)
+LOG_DIR = Path(os.path.dirname(os.path.dirname(__file__))) / "data" / "logs"
+LOG_DIR.mkdir(parents=True, exist_ok=True)
+
+# Per-device file loggers (cached)
+_device_file_loggers: dict = {}
+
+# Log retention (days)
+LOG_RETENTION_DAYS = int(os.getenv("LOG_RETENTION_DAYS", "7"))
+
+
+def _get_file_logger(unit_id: str) -> logging.Logger:
+    """Get or create a file logger for a specific device."""
+    if unit_id in _device_file_loggers:
+        return _device_file_loggers[unit_id]
+
+    # Create device-specific logger
+    device_logger = logging.getLogger(f"device.{unit_id}")
+    device_logger.setLevel(logging.DEBUG)
+
+    # Avoid duplicate handlers
+    if not device_logger.handlers:
+        # Create rotating file handler (5 MB max, keep 3 backups)
+        log_file = LOG_DIR / f"{unit_id}.log"
+        handler = RotatingFileHandler(
+            log_file,
+            maxBytes=5 * 1024 * 1024,  # 5 MB
+            backupCount=3,
+            encoding="utf-8"
+        )
+        handler.setLevel(logging.DEBUG)
+
+        # Format: timestamp [LEVEL] [CATEGORY] message
+        formatter = logging.Formatter(
+            "%(asctime)s [%(levelname)s] [%(category)s] %(message)s",
+            datefmt="%Y-%m-%d %H:%M:%S"
+        )
+        handler.setFormatter(formatter)
+        device_logger.addHandler(handler)
+
+        # Don't propagate to root logger
+        device_logger.propagate = False
+
+    _device_file_loggers[unit_id] = device_logger
+    return device_logger
+
+
+def log_device_event(
+    unit_id: str,
+    level: str,
+    category: str,
+    message: str,
+    db: Optional[Session] = None
+):
+    """
+    Log an event for a specific device.
+
+    Writes to both:
+    1. Database (DeviceLog table) for structured queries
+    2. File (data/logs/{unit_id}.log) for backup/debugging
+
+    Args:
+        unit_id: Device identifier
+        level: Log level (DEBUG, INFO, WARNING, ERROR)
+        category: Event category (TCP, FTP, POLL, COMMAND, STATE, SYNC)
+        message: Log message
+        db: Optional database session (creates one if not provided)
+    """
+    timestamp = datetime.utcnow()
+
+    # Write to file log
+    try:
+        file_logger = _get_file_logger(unit_id)
+        log_func = getattr(file_logger, level.lower(), file_logger.info)
+        # Pass category as extra for formatter
+        log_func(message, extra={"category": category})
+    except Exception as e:
+        logger.warning(f"Failed to write file log for {unit_id}: {e}")
+
+    # Write to database
+    close_db = False
+    try:
+        if db is None:
+            db = SessionLocal()
+            close_db = True
+
+        log_entry = DeviceLog(
+            unit_id=unit_id,
+            timestamp=timestamp,
+            level=level.upper(),
+            category=category.upper(),
+            message=message
+        )
+        db.add(log_entry)
+        db.commit()
+
+    except Exception as e:
+        logger.warning(f"Failed to write DB log for {unit_id}: {e}")
+        if db:
+            db.rollback()
+    finally:
+        if close_db and db:
+            db.close()
+
+
+def cleanup_old_logs(retention_days: Optional[int] = None, db: Optional[Session] = None):
+    """
+    Delete log entries older than retention period.
+
+    Args:
+        retention_days: Days to retain (default: LOG_RETENTION_DAYS env var or 7)
+        db: Optional database session
+    """
+    if retention_days is None:
+        retention_days = LOG_RETENTION_DAYS
+
+    cutoff = datetime.utcnow() - timedelta(days=retention_days)
+
+    close_db = False
+    try:
+        if db is None:
+            db = SessionLocal()
+            close_db = True
+
+        deleted = db.query(DeviceLog).filter(DeviceLog.timestamp < cutoff).delete()
+        db.commit()
+
+        if deleted > 0:
+            logger.info(f"Cleaned up {deleted} log entries older than {retention_days} days")
+
+    except Exception as e:
+        logger.error(f"Failed to cleanup old logs: {e}")
+        if db:
+            db.rollback()
+    finally:
+        if close_db and db:
+            db.close()
+
+
+def get_device_logs(
+    unit_id: str,
+    limit: int = 100,
+    offset: int = 0,
+    level: Optional[str] = None,
+    category: Optional[str] = None,
+    since: Optional[datetime] = None,
+    db: Optional[Session] = None
+) -> list:
+    """
+    Query log entries for a specific device.
+
+    Args:
+        unit_id: Device identifier
+        limit: Max entries to return (default: 100)
+        offset: Number of entries to skip (default: 0)
+        level: Filter by level (DEBUG, INFO, WARNING, ERROR)
+        category: Filter by category (TCP, FTP, POLL, COMMAND, STATE, SYNC)
+        since: Filter entries after this timestamp
+        db: Optional database session
+
+    Returns:
+        List of log entries as dicts
+    """
+    close_db = False
+    try:
+        if db is None:
+            db = SessionLocal()
+            close_db = True
+
+        query = db.query(DeviceLog).filter(DeviceLog.unit_id == unit_id)
+
+        if level:
+            query = query.filter(DeviceLog.level == level.upper())
+        if category:
+            query = query.filter(DeviceLog.category == category.upper())
+        if since:
+            query = query.filter(DeviceLog.timestamp >= since)
+
+        # Order by newest first
+        query = query.order_by(DeviceLog.timestamp.desc())
+
+        # Apply pagination
+        entries = query.offset(offset).limit(limit).all()
+
+        return [
+            {
+                "id": e.id,
+                "timestamp": e.timestamp.isoformat() + "Z",
+                "level": e.level,
+                "category": e.category,
+                "message": e.message
+            }
+            for e in entries
+        ]
+
+    finally:
+        if close_db and db:
+            db.close()
+
+
+def get_log_stats(unit_id: str, db: Optional[Session] = None) -> dict:
+    """
+    Get log statistics for a device.
+
+    Returns:
+        Dict with counts by level and category
+    """
+    close_db = False
+    try:
+        if db is None:
+            db = SessionLocal()
+            close_db = True
+
+        total = db.query(DeviceLog).filter(DeviceLog.unit_id == unit_id).count()
+
+        # Count by level
+        level_counts = {}
+        for level in ["DEBUG", "INFO", "WARNING", "ERROR"]:
+            count = db.query(DeviceLog).filter(
+                DeviceLog.unit_id == unit_id,
+                DeviceLog.level == level
+            ).count()
+            if count > 0:
+                level_counts[level] = count
+
+        # Count by category
+        category_counts = {}
+        for category in ["TCP", "FTP", "POLL", "COMMAND", "STATE", "SYNC", "GENERAL"]:
+            count = db.query(DeviceLog).filter(
+                DeviceLog.unit_id == unit_id,
+                DeviceLog.category == category
+            ).count()
+            if count > 0:
+                category_counts[category] = count
+
+        # Get oldest and newest
+        oldest = db.query(DeviceLog).filter(
+            DeviceLog.unit_id == unit_id
+        ).order_by(DeviceLog.timestamp.asc()).first()
+
+        newest = db.query(DeviceLog).filter(
+            DeviceLog.unit_id == unit_id
+        ).order_by(DeviceLog.timestamp.desc()).first()
+
+        return {
+            "total": total,
+            "by_level": level_counts,
+            "by_category": category_counts,
+            "oldest": oldest.timestamp.isoformat() + "Z" if oldest else None,
+            "newest": newest.timestamp.isoformat() + "Z" if newest else None
+        }
+
+    finally:
+        if close_db and db:
+            db.close()
@@ -29,23 +29,49 @@ logger.info("Database tables initialized")
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage application lifecycle - startup and shutdown events."""
+    from app.services import _connection_pool
+
    # Startup
+    logger.info("Starting TCP connection pool cleanup task...")
+    _connection_pool.start_cleanup()
    logger.info("Starting background poller...")
    await poller.start()
    logger.info("Background poller started")

+    # Auto-start keepalive live monitors for units configured for 24/7 monitoring
+    # (monitor_enabled). This is what keeps alerting running unattended across
+    # restarts — without it a feed only runs while someone has the live view open.
+    try:
+        from app.monitor import monitor_manager
+        from app.database import SessionLocal
+        from app.models import NL43Config
+        db = SessionLocal()
+        try:
+            units = db.query(NL43Config).filter_by(monitor_enabled=True, tcp_enabled=True).all()
+            for cfg in units:
+                m = await monitor_manager.get(cfg.unit_id)
+                await m.set_keepalive(True)
+                logger.info(f"Auto-started keepalive monitor for {cfg.unit_id}")
+        finally:
+            db.close()
+    except Exception as e:
+        logger.error(f"Failed to auto-start monitors: {e}")
+
    yield  # Application runs

    # Shutdown
    logger.info("Stopping background poller...")
    await poller.stop()
    logger.info("Background poller stopped")
+    logger.info("Closing TCP connection pool...")
+    await _connection_pool.close_all()
+    logger.info("TCP connection pool closed")


 app = FastAPI(
    title="SLMM NL43 Addon",
    description="Standalone module for NL43 configuration and status APIs with background polling",
-    version="0.2.0",
+    version="0.4.0",
    lifespan=lifespan,
 )

@@ -69,12 +95,12 @@ app.include_router(routers.router)

@app.get("/", response_class=HTMLResponse)
 def index(request: Request):
-    return templates.TemplateResponse("index.html", {"request": request})
+    return templates.TemplateResponse(request, "index.html")


@app.get("/roster", response_class=HTMLResponse)
 def roster(request: Request):
-    return templates.TemplateResponse("roster.html", {"request": request})
+    return templates.TemplateResponse(request, "roster.html")


@app.get("/health")
@@ -85,10 +111,14 @@ async def health():

@app.get("/health/devices")
 async def health_devices():
-    """Enhanced health check that tests device connectivity."""
+    """Enhanced health check that tests device connectivity.
+
+    Uses the connection pool to avoid unnecessary TCP handshakes — if a
+    cached connection exists and is alive, the device is reachable.
+    """
    from sqlalchemy.orm import Session
    from app.database import SessionLocal
-    from app.services import NL43Client
+    from app.services import _connection_pool
    from app.models import NL43Config

    db: Session = SessionLocal()
@@ -98,7 +128,7 @@ async def health_devices():
        configs = db.query(NL43Config).filter_by(tcp_enabled=True).all()

        for cfg in configs:
-            client = NL43Client(cfg.host, cfg.tcp_port, timeout=2.0, ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password)
+            device_key = f"{cfg.host}:{cfg.tcp_port}"
            status = {
                "unit_id": cfg.unit_id,
                "host": cfg.host,
@@ -108,14 +138,22 @@ async def health_devices():
            }

            try:
-                # Try to connect (don't send command to avoid rate limiting issues)
-                import asyncio
-                reader, writer = await asyncio.wait_for(
-                    asyncio.open_connection(cfg.host, cfg.tcp_port), timeout=2.0
-                )
-                writer.close()
-                await writer.wait_closed()
-                status["reachable"] = True
+                # Check if pool already has a live connection (zero-cost check)
+                pool_stats = _connection_pool.get_stats()
+                conn_info = pool_stats["connections"].get(device_key)
+                if conn_info and conn_info["alive"]:
+                    status["reachable"] = True
+                    status["source"] = "pool"
+                else:
+                    # No cached connection — do a lightweight acquire/release
+                    # This opens a connection if needed but keeps it in the pool
+                    import asyncio
+                    reader, writer, from_cache = await _connection_pool.acquire(
+                        device_key, cfg.host, cfg.tcp_port, timeout=2.0
+                    )
+                    await _connection_pool.release(device_key, reader, writer, cfg.host, cfg.tcp_port)
+                    status["reachable"] = True
+                    status["source"] = "cached" if from_cache else "new"
            except Exception as e:
                status["error"] = str(type(e).__name__)
                logger.warning(f"Device {cfg.unit_id} health check failed: {e}")
@@ -1,4 +1,4 @@
-from sqlalchemy import Column, String, DateTime, Boolean, Integer, Text, func
+from sqlalchemy import Column, String, DateTime, Boolean, Integer, Float, Text, func
 from app.database import Base


@@ -23,6 +23,10 @@ class NL43Config(Base):
    poll_interval_seconds = Column(Integer, nullable=True, default=60)  # Polling interval (10-3600 seconds)
    poll_enabled = Column(Boolean, default=True)  # Enable/disable background polling for this device

+    # Live monitor (fan-out DOD feed). Keepalive runs it 24/7 even with no viewer,
+    # which is what makes alerting continuous. On by default; toggleable from the UI.
+    monitor_enabled = Column(Boolean, default=True)
+

 class NL43Status(Base):
    """
@@ -41,6 +45,8 @@ class NL43Status(Base):
    lmax = Column(String, nullable=True)  # Maximum level
    lmin = Column(String, nullable=True)  # Minimum level
    lpeak = Column(String, nullable=True)  # Peak level
+    ln1 = Column(String, nullable=True)  # Percentile slot LN1 (configurable; device default L5, contract L1)
+    ln2 = Column(String, nullable=True)  # Percentile slot LN2 (configurable; device default L10)
    battery_level = Column(String, nullable=True)
    power_source = Column(String, nullable=True)
    sd_remaining_mb = Column(String, nullable=True)
@@ -53,3 +59,93 @@ class NL43Status(Base):
    last_poll_attempt = Column(DateTime, nullable=True)  # Last time background poller attempted to poll
    last_success = Column(DateTime, nullable=True)  # Last successful poll timestamp
    last_error = Column(Text, nullable=True)  # Last error message (truncated to 500 chars)
+
+    # FTP start time sync tracking
+    start_time_sync_attempted = Column(Boolean, default=False)  # True if FTP sync was attempted for current measurement
+
+
+class DeviceLog(Base):
+    """
+    Per-device log entries for debugging and audit trail.
+    Stores events like commands, state changes, errors, and FTP operations.
+    """
+
+    __tablename__ = "device_logs"
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    unit_id = Column(String, index=True, nullable=False)
+    timestamp = Column(DateTime, default=func.now(), index=True)
+    level = Column(String, default="INFO")  # DEBUG, INFO, WARNING, ERROR
+    category = Column(String, default="GENERAL")  # TCP, FTP, POLL, COMMAND, STATE, SYNC
+    message = Column(Text, nullable=False)
+
+
+class AlertRule(Base):
+    """A threshold-alert rule evaluated against a unit's live monitor feed.
+
+    Source-agnostic: today it runs over the DOD monitor; the same rule transfers
+    unchanged if a unit's feed is later sourced from FTP intervals.
+    """
+
+    __tablename__ = "alert_rules"
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    unit_id = Column(String, index=True, nullable=False)
+    name = Column(String, nullable=False, default="Alert")
+    metric = Column(String, nullable=False, default="lp")  # lp/leq/lmax/lmin/lpeak/ln1/ln2
+    comparison = Column(String, nullable=False, default="above")  # above | below
+    threshold_db = Column(Float, nullable=False)
+    duration_s = Column(Integer, nullable=False, default=0)       # sustained seconds (0 = instant)
+    clear_margin_db = Column(Float, nullable=False, default=2.0)  # hysteresis band
+    cooldown_s = Column(Integer, nullable=False, default=300)     # min seconds between onsets
+    # Optional time-of-day scoping (local time). schedule_start/end as "HH:MM";
+    # null = always active. schedule_days = CSV of 0-6 (Mon=0); null = every day.
+    schedule_start = Column(String, nullable=True)
+    schedule_end = Column(String, nullable=True)
+    schedule_days = Column(String, nullable=True)
+    channels = Column(String, nullable=False, default="log")  # CSV: log,email,sms
+    recipients = Column(Text, nullable=True)                  # CSV of emails/phones
+    enabled = Column(Boolean, default=True)
+    created_at = Column(DateTime, default=func.now())
+
+
+class AlertEvent(Base):
+    """A fired alert (onset → clear), for history / inbox / acknowledgement."""
+
+    __tablename__ = "alert_events"
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    rule_id = Column(Integer, index=True, nullable=False)
+    unit_id = Column(String, index=True, nullable=False)
+    rule_name = Column(String, nullable=True)
+    metric = Column(String, nullable=False)
+    threshold_db = Column(Float, nullable=False)
+    onset_at = Column(DateTime, default=func.now(), index=True)
+    onset_value = Column(Float, nullable=True)
+    peak_value = Column(Float, nullable=True)
+    clear_at = Column(DateTime, nullable=True)
+    status = Column(String, default="active")  # active | cleared
+    acknowledged_at = Column(DateTime, nullable=True)
+    acknowledged_by = Column(String, nullable=True)
+    notes = Column(Text, nullable=True)
+
+
+class NL43Reading(Base):
+    """Downsampled time-series of live-monitor readings, for the live-chart
+    backfill (so a viewer sees recent trend on open, not a blank chart).
+
+    Viewing only — NOT the report source. Reports use the device's authoritative
+    FTP .rnd intervals. This is a short, capped trail (one row/minute, pruned to
+    a retention window) fed by the monitor's keepalive poll loop.
+    """
+
+    __tablename__ = "nl43_readings"
+
+    id = Column(Integer, primary_key=True, autoincrement=True)
+    unit_id = Column(String, index=True, nullable=False)
+    timestamp = Column(DateTime, default=func.now(), index=True)
+    lp = Column(String, nullable=True)
+    leq = Column(String, nullable=True)
+    lmax = Column(String, nullable=True)
+    ln1 = Column(String, nullable=True)
+    ln2 = Column(String, nullable=True)
@@ -0,0 +1,322 @@
+"""
+Per-device live monitor (fan-out hub).
+
+ONE DOD poll loop per device, broadcast to many subscribers:
+- browser WebSocket clients (live view) — they no longer each open their own
+  device stream, so the NL43's single-connection limit stops causing the
+  "second viewer sees nothing" contention.
+- the alert evaluator (threshold alerts), which can keep a device's feed running
+  even with no browser attached.
+- persistence (each snapshot is written to NL43Status, like the poller does).
+
+The device's one TCP connection is respected: every poll goes through the same
+per-device lock + connection pool in services.py, so the monitor, the background
+poller, and on-demand commands all serialize safely.
+"""
+
+import asyncio
+import logging
+import os
+from datetime import datetime
+from typing import Dict, Optional, Set
+
+from app.database import SessionLocal
+from app.models import NL43Config, NL43Status
+from app.services import NL43Client, persist_snapshot
+from app.alerts import alert_evaluator
+
+logger = logging.getLogger(__name__)
+
+# Extra idle between DOD polls WHEN A BROWSER IS WATCHING. The 1s device rate-limit
+# already paces consecutive DOD? commands, so this just needs to be small — the
+# rate-limit is the real floor (~1.25s/poll effective).
+MONITOR_POLL_INTERVAL = float(os.getenv("MONITOR_POLL_INTERVAL", "0.25"))
+
+# Idle cadence when NO browser is subscribed and the feed is only kept alive for
+# alerting. Same data, ~8x fewer polls -> ~8x less cellular traffic on a metered
+# SIM (~1 GB/device/month at full rate -> ~125 MB). NOTE: this also sets the alert
+# sampling resolution when nobody is watching, so keep it <= the smallest alert
+# duration_s you rely on (default 10s comfortably catches a "sustained 30/60s" rule).
+MONITOR_IDLE_POLL_INTERVAL = float(os.getenv("MONITOR_IDLE_POLL_INTERVAL", "10"))
+
+# Exponential backoff once the device is unreachable, so a powered-off / asleep /
+# out-of-signal device stops churning reconnects every cycle (log spam + a trickle
+# of wasted cellular data on failed SYNs). delay = min(BASE * 2**(fails-1), MAX),
+# reset to full-rate on the first good poll. While a browser is actively watching we
+# cap the backoff lower (WATCHED_MAX) so a recovery surfaces quickly for the viewer.
+MONITOR_BACKOFF_BASE_S = float(os.getenv("MONITOR_BACKOFF_BASE_S", "1"))
+MONITOR_BACKOFF_MAX_S = float(os.getenv("MONITOR_BACKOFF_MAX_S", "60"))
+MONITOR_BACKOFF_WATCHED_MAX_S = float(os.getenv("MONITOR_BACKOFF_WATCHED_MAX_S", "5"))
+
+# How often to refresh the run state (Measure?). It changes rarely, so we cache it
+# and skip that second rate-limited command on most polls — roughly halving the
+# per-update latency (~2.5s -> ~1.3s).
+MONITOR_STATE_REFRESH_S = float(os.getenv("MONITOR_STATE_REFRESH_S", "30"))
+
+# Downsampled trail for the live-chart backfill: store one reading per
+# TRAIL_SAMPLE_S and keep TRAIL_RETENTION_HOURS of it (pruned). Viewing only —
+# reports use the device's FTP .rnd data, not this.
+TRAIL_SAMPLE_S = float(os.getenv("MONITOR_TRAIL_SAMPLE_S", "60"))
+TRAIL_RETENTION_HOURS = float(os.getenv("MONITOR_TRAIL_RETENTION_HOURS", "24"))
+
+# If nothing has been broadcast in this many seconds (e.g. device offline and
+# silent), send a keepalive frame so reverse proxies don't drop the idle WS.
+MONITOR_HEARTBEAT_S = float(os.getenv("MONITOR_HEARTBEAT_S", "25"))
+
+
+def _snapshot_payload(snap, unit_id: str, measurement_start_time) -> dict:
+    """Build the broadcast payload — same shape as the DRD stream, but DOD-sourced
+    so it carries ln1/ln2 (which DRD cannot)."""
+    return {
+        "unit_id": unit_id,
+        "timestamp": datetime.utcnow().isoformat(),
+        "measurement_state": snap.measurement_state,
+        "measurement_start_time": measurement_start_time,
+        "counter": snap.counter,
+        "lp": snap.lp,
+        "leq": snap.leq,
+        "lmax": snap.lmax,
+        "lmin": snap.lmin,
+        "lpeak": snap.lpeak,
+        "ln1": snap.ln1,
+        "ln2": snap.ln2,
+        "raw_payload": snap.raw_payload,
+    }
+
+
+class DeviceMonitor:
+    """Owns a single DOD poll loop for one device and fans each snapshot out to
+    all subscribers. Runs while it has at least one browser subscriber OR the
+    server-side keep-alive (alerting) flag is set."""
+
+    def __init__(self, unit_id: str):
+        self.unit_id = unit_id
+        self._subscribers: Set[asyncio.Queue] = set()
+        self._keepalive = False
+        self._task: Optional[asyncio.Task] = None
+        self._lock = asyncio.Lock()
+        self._last_payload: Optional[dict] = None  # replayed to new subscribers
+        self._consec_fail = 0
+        self._reachable = True  # last broadcast reachability (for transition frames)
+        self._cached_state: Optional[str] = None  # run state, refreshed periodically
+        self._last_state_refresh = 0.0
+        self._last_trail_store = 0.0  # downsample throttle for the backfill trail
+
+    @property
+    def running(self) -> bool:
+        return self._task is not None and not self._task.done()
+
+    def subscriber_count(self) -> int:
+        return len(self._subscribers)
+
+    def _has_demand(self) -> bool:
+        return bool(self._subscribers) or self._keepalive
+
+    def _ensure_task(self) -> None:
+        if self._task is None or self._task.done():
+            self._task = asyncio.create_task(self._run())
+
+    async def subscribe(self) -> asyncio.Queue:
+        q: asyncio.Queue = asyncio.Queue(maxsize=5)
+        async with self._lock:
+            self._subscribers.add(q)
+            # Replay the last frame so a client connecting mid-stream sees data
+            # (or the current 'unreachable' state) immediately, not after a poll.
+            if self._last_payload is not None:
+                try:
+                    q.put_nowait(self._last_payload)
+                except asyncio.QueueFull:
+                    pass
+            self._ensure_task()
+        return q
+
+    async def unsubscribe(self, q: asyncio.Queue) -> None:
+        async with self._lock:
+            self._subscribers.discard(q)
+
+    async def set_keepalive(self, on: bool) -> None:
+        async with self._lock:
+            self._keepalive = on
+            if on:
+                self._ensure_task()
+
+    async def _run(self) -> None:
+        logger.info(f"[MONITOR] {self.unit_id}: feed started")
+        loop = asyncio.get_running_loop()
+        last_send = loop.time()
+        try:
+            while self._has_demand():
+                snap, mst = await self._poll_once()
+                if snap is not None:
+                    if not self._reachable:
+                        # Recovered from an outage — clear the connectivity alert.
+                        try:
+                            await alert_evaluator.device_online(self.unit_id)
+                        except Exception as e:
+                            logger.warning(f"[MONITOR] {self.unit_id}: online alert failed: {e}")
+                    self._consec_fail = 0
+                    self._reachable = True
+                    payload = _snapshot_payload(snap, self.unit_id, mst)
+                    payload["feed_status"] = "ok"
+                    self._broadcast(payload)
+                    last_send = loop.time()
+                    try:
+                        await alert_evaluator.evaluate(self.unit_id, snap)
+                    except Exception as e:
+                        logger.warning(f"[MONITOR] {self.unit_id}: alert eval failed: {e}")
+                else:
+                    # Tell clients the device went offline — once, on transition, after a
+                    # few failures so a momentary blip doesn't flap the UI. Same edge
+                    # raises the device-offline alert.
+                    self._consec_fail += 1
+                    if self._reachable and self._consec_fail >= 3:
+                        self._reachable = False
+                        self._broadcast({
+                            "unit_id": self.unit_id,
+                            "timestamp": datetime.utcnow().isoformat(),
+                            "feed_status": "unreachable",
+                        })
+                        last_send = loop.time()
+                        try:
+                            await alert_evaluator.device_offline(self.unit_id)
+                        except Exception as e:
+                            logger.warning(f"[MONITOR] {self.unit_id}: offline alert failed: {e}")
+
+                # Heartbeat: during quiet/offline stretches, send a keepalive so an
+                # idle WS isn't dropped by a reverse proxy. Not cached (new subscribers
+                # should still get the last real frame, not a heartbeat).
+                if loop.time() - last_send >= MONITOR_HEARTBEAT_S:
+                    self._broadcast({
+                        "unit_id": self.unit_id,
+                        "timestamp": datetime.utcnow().isoformat(),
+                        "feed_status": "ok" if self._reachable else "unreachable",
+                        "heartbeat": True,
+                    }, cache=False)
+                    last_send = loop.time()
+
+                await asyncio.sleep(self._next_delay())
+        finally:
+            logger.info(f"[MONITOR] {self.unit_id}: feed stopped")
+
+    def _next_delay(self) -> float:
+        """Inter-poll delay: exponential backoff while unreachable, full-rate while a
+        browser is watching, relaxed cadence when the feed is keepalive-only."""
+        if self._consec_fail > 0:
+            shift = min(self._consec_fail - 1, 6)  # cap growth at 2**6 = 64x base
+            delay = min(MONITOR_BACKOFF_BASE_S * (2 ** shift), MONITOR_BACKOFF_MAX_S)
+            if self._subscribers:
+                delay = min(delay, MONITOR_BACKOFF_WATCHED_MAX_S)
+            return delay
+        if self._subscribers:
+            return MONITOR_POLL_INTERVAL       # a browser is watching — smooth chart
+        return MONITOR_IDLE_POLL_INTERVAL      # keepalive-only (alerting) — save data
+
+    async def _poll_once(self):
+        """One DOD poll: read, persist, return (snapshot, measurement_start_iso)."""
+        db = SessionLocal()
+        try:
+            cfg = db.query(NL43Config).filter_by(unit_id=self.unit_id).first()
+            if not cfg or not cfg.tcp_enabled:
+                return None, None
+            client = NL43Client(
+                cfg.host, cfg.tcp_port,
+                ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password,
+                ftp_port=cfg.ftp_port or 21,
+            )
+            # Refresh the run state only every MONITOR_STATE_REFRESH_S; reuse the
+            # cached state otherwise so most polls send just DOD? (one rate-limited
+            # command) instead of DOD? + Measure?.
+            now = asyncio.get_running_loop().time()
+            refresh_state = (self._cached_state is None
+                             or now - self._last_state_refresh >= MONITOR_STATE_REFRESH_S)
+            snap = await client.request_dod(
+                measurement_state=None if refresh_state else self._cached_state
+            )
+            if refresh_state:
+                self._cached_state = snap.measurement_state
+                self._last_state_refresh = now
+            snap.unit_id = self.unit_id
+            persist_snapshot(snap, db)
+            db.commit()
+            # Append to the downsampled backfill trail (~one row per TRAIL_SAMPLE_S).
+            if now - self._last_trail_store >= TRAIL_SAMPLE_S:
+                self._last_trail_store = now
+                self._store_trail(snap, db)
+            status = db.query(NL43Status).filter_by(unit_id=self.unit_id).first()
+            mst = (status.measurement_start_time.isoformat()
+                   if status and status.measurement_start_time else None)
+            return snap, mst
+        except Exception as e:
+            logger.warning(f"[MONITOR] {self.unit_id}: poll failed: {e}")
+            return None, None
+        finally:
+            db.close()
+
+    def _store_trail(self, snap, db) -> None:
+        """Append one downsampled reading to the backfill trail and prune old rows."""
+        from datetime import datetime, timedelta
+        from app.models import NL43Reading
+        try:
+            db.add(NL43Reading(
+                unit_id=self.unit_id, timestamp=datetime.utcnow(),
+                lp=snap.lp, leq=snap.leq, lmax=snap.lmax, ln1=snap.ln1, ln2=snap.ln2,
+            ))
+            cutoff = datetime.utcnow() - timedelta(hours=TRAIL_RETENTION_HOURS)
+            db.query(NL43Reading).filter(
+                NL43Reading.unit_id == self.unit_id,
+                NL43Reading.timestamp < cutoff,
+            ).delete()
+            db.commit()
+        except Exception as e:
+            logger.warning(f"[MONITOR] {self.unit_id}: trail store failed: {e}")
+
+    def _broadcast(self, payload: dict, cache: bool = True) -> None:
+        if cache:
+            self._last_payload = payload  # replayed to new subscribers
+        for q in list(self._subscribers):
+            try:
+                q.put_nowait(payload)
+            except asyncio.QueueFull:
+                # Slow consumer — drop this frame rather than stall the whole feed.
+                pass
+
+
+class MonitorManager:
+    """Registry of per-device monitors (one per unit_id)."""
+
+    def __init__(self):
+        self._monitors: Dict[str, DeviceMonitor] = {}
+        self._lock = asyncio.Lock()
+
+    async def get(self, unit_id: str) -> DeviceMonitor:
+        async with self._lock:
+            m = self._monitors.get(unit_id)
+            if m is None:
+                m = DeviceMonitor(unit_id)
+                self._monitors[unit_id] = m
+            return m
+
+    def is_active(self, unit_id: str) -> bool:
+        """True if this unit has a running monitor feed (so the background poller
+        can skip it — the monitor already polls it more often)."""
+        m = self._monitors.get(unit_id)
+        return m is not None and m.running
+
+    def status(self) -> dict:
+        return {
+            uid: {
+                "running": m.running,
+                "subscribers": m.subscriber_count(),
+                "keepalive": m._keepalive,
+                "reachable": m._reachable,
+                # what cadence the loop is currently using, for observability
+                "mode": ("backoff" if m._consec_fail > 0
+                         else "watched" if m._subscribers
+                         else "idle"),
+            }
+            for uid, m in self._monitors.items()
+        }
+
+
+# Module-level singleton
+monitor_manager = MonitorManager()
@@ -3,6 +3,7 @@ from fastapi.responses import FileResponse
 from sqlalchemy.orm import Session
 from datetime import datetime
 from pydantic import BaseModel, field_validator, Field
+from typing import Optional
 import logging
 import ipaddress
 import json
@@ -10,7 +11,7 @@ import os
 import asyncio

 from app.database import get_db
-from app.models import NL43Config, NL43Status
+from app.models import NL43Config, NL43Status, AlertRule, AlertEvent, NL43Reading
 from app.services import NL43Client, persist_snapshot

 logger = logging.getLogger(__name__)
@@ -81,17 +82,431 @@ class ConfigPayload(BaseModel):
    @field_validator("poll_interval_seconds")
    @classmethod
    def validate_poll_interval(cls, v):
-        if v is not None and not (10 <= v <= 3600):
-            raise ValueError("Poll interval must be between 10 and 3600 seconds")
+        if v is not None and not (30 <= v <= 21600):
+            raise ValueError("Poll interval must be between 30 and 21600 seconds (30s to 6 hours)")
        return v


 class PollingConfigPayload(BaseModel):
    """Payload for updating device polling configuration."""
-    poll_interval_seconds: int | None = Field(None, ge=10, le=3600, description="Polling interval in seconds (10-3600)")
+    poll_interval_seconds: int | None = Field(None, ge=30, le=21600, description="Polling interval in seconds (30s to 6 hours)")
    poll_enabled: bool | None = Field(None, description="Enable or disable background polling for this device")


+# ============================================================================
+# TCP CONNECTION POOL ENDPOINTS (must be before /{unit_id} routes)
+# ============================================================================
+
+@router.get("/_connections/status")
+async def get_connection_pool_status():
+    """Get status of the persistent TCP connection pool.
+
+    Returns information about cached connections, keepalive settings,
+    and per-device connection age/idle times.
+    """
+    from app.services import _connection_pool
+    return {"status": "ok", "pool": _connection_pool.get_stats()}
+
+
+@router.post("/_connections/flush")
+async def flush_connection_pool():
+    """Close all cached TCP connections.
+
+    Useful for debugging or forcing fresh connections to all devices.
+    """
+    from app.services import _connection_pool
+    await _connection_pool.close_all()
+    # Restart cleanup task since close_all cancels it
+    _connection_pool.start_cleanup()
+    return {"status": "ok", "message": "All cached connections closed"}
+
+
+@router.post("/{unit_id}/disconnect")
+async def disconnect_device(unit_id: str, db: Session = Depends(get_db)):
+    """Cleanly close SLMM's persistent TCP connection to a single device.
+
+    Gracefully closes (TCP FIN + wait_closed) the pooled connection for this
+    device and removes it from the pool, freeing the NL43's single connection
+    slot. Idempotent — a no-op if no connection is currently cached.
+
+    Note: this releases the *idle* pooled connection. It does not interrupt an
+    in-progress DRD stream or an in-flight command (those have the socket
+    checked out of the pool) — close the stream WebSocket to end a live stream.
+    """
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if not cfg:
+        raise HTTPException(status_code=404, detail="NL43 config not found")
+
+    from app.services import _connection_pool
+
+    device_key = f"{cfg.host}:{cfg.tcp_port}"
+    had_conn = device_key in _connection_pool.get_stats().get("connections", {})
+
+    await _connection_pool.discard(device_key)
+
+    return {
+        "status": "ok",
+        "unit_id": unit_id,
+        "device_key": device_key,
+        "disconnected": had_conn,
+        "message": "Connection closed" if had_conn else "No cached connection to close",
+    }
+
+
+@router.post("/{unit_id}/deactivate")
+async def deactivate_device(unit_id: str, db: Session = Depends(get_db)):
+    """Make a single unit dormant: stop background polling for it AND drop its
+    connection, freeing the device's connection slot. poll_enabled=False is
+    persisted, so the unit stays dormant across restarts until /activate.
+    """
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if not cfg:
+        raise HTTPException(status_code=404, detail="NL43 config not found")
+
+    cfg.poll_enabled = False
+    db.commit()
+
+    from app.services import _connection_pool, _get_device_lock
+
+    device_key = f"{cfg.host}:{cfg.tcp_port}"
+
+    # Wait briefly for any in-flight poll/command to finish (so its connection is
+    # back in the pool), then drop it. If a long-lived stream holds the lock we
+    # don't block forever — discard the pooled connection regardless.
+    lock = await _get_device_lock(device_key)
+    acquired = False
+    try:
+        await asyncio.wait_for(lock.acquire(), timeout=10.0)
+        acquired = True
+    except asyncio.TimeoutError:
+        acquired = False
+    try:
+        await _connection_pool.discard(device_key)
+    finally:
+        if acquired:
+            lock.release()
+
+    return {
+        "status": "ok",
+        "unit_id": unit_id,
+        "poll_enabled": False,
+        "message": "Polling disabled and connection closed for this unit",
+    }
+
+
+@router.post("/{unit_id}/activate")
+async def activate_device(unit_id: str, db: Session = Depends(get_db)):
+    """Resume background polling for a unit previously deactivated."""
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if not cfg:
+        raise HTTPException(status_code=404, detail="NL43 config not found")
+
+    cfg.poll_enabled = True
+    db.commit()
+
+    return {
+        "status": "ok",
+        "unit_id": unit_id,
+        "poll_enabled": True,
+        "message": "Polling enabled for this unit",
+    }
+
+
+@router.get("/_system/status")
+async def system_status():
+    """Report whether this SLMM instance is actively polling or in standby."""
+    from app.background_poller import poller
+    from app.services import _connection_pool
+    return {
+        "status": "ok",
+        "mode": "active" if poller.is_active() else "standby",
+        "polling_active": poller.is_active(),
+        "active_connections": _connection_pool.get_stats().get("active_connections", 0),
+    }
+
+
+@router.post("/_system/standby")
+async def system_standby():
+    """Put this SLMM instance into standby: stop polling ALL devices and release
+    every connection, so it stops occupying device slots (e.g. so a prod instance
+    can take over). Runtime-only — on restart the instance returns to its
+    SLMM_POLLING_ENABLED default.
+    """
+    from app.background_poller import poller
+    await poller.set_active(False)
+    return {"status": "ok", "mode": "standby",
+            "message": "Polling stopped and all device connections released"}
+
+
+@router.post("/_system/resume")
+async def system_resume():
+    """Resume polling after standby (global)."""
+    from app.background_poller import poller
+    await poller.set_active(True)
+    return {"status": "ok", "mode": "active", "message": "Polling resumed"}
+
+
+# ============================================================================
+# LIVE MONITOR (fan-out) — one DOD feed per device, broadcast to many clients
+# ============================================================================
+
+@router.websocket("/{unit_id}/monitor")
+async def monitor_stream(websocket: WebSocket, unit_id: str):
+    """Subscribe a browser to the device's shared 1 Hz DOD feed.
+
+    Any number of clients can attach without each opening its own device
+    connection (one poll loop per device, fanned out). Same JSON shape as the
+    DRD stream, but DOD-sourced so it includes ln1/ln2 (L1/L10).
+    """
+    await websocket.accept()
+    from app.monitor import monitor_manager
+
+    monitor = await monitor_manager.get(unit_id)
+    queue = await monitor.subscribe()
+    logger.info(f"Monitor subscriber attached for {unit_id} ({monitor.subscriber_count()} total)")
+
+    async def _watch_disconnect():
+        # Completes when the client disconnects, so an idle feed (no data) still
+        # detects the drop and we don't leak a subscription that keeps the device
+        # feed (and its connection) alive.
+        try:
+            while True:
+                msg = await websocket.receive()
+                if msg.get("type") == "websocket.disconnect":
+                    return
+        except Exception:
+            return
+
+    gone = asyncio.ensure_future(_watch_disconnect())
+    try:
+        while not gone.done():
+            try:
+                payload = await asyncio.wait_for(queue.get(), timeout=1.0)
+            except asyncio.TimeoutError:
+                continue  # re-check gone.done()
+            if gone.done():
+                break  # client disconnected while we waited — don't send into a closing socket
+            await websocket.send_json(payload)
+    except WebSocketDisconnect:
+        logger.info(f"Monitor subscriber disconnected for {unit_id}")
+    except Exception as e:
+        # A frame that races the close (client vanished mid-send) surfaces as
+        # "Unexpected ASGI message 'websocket.send' after ... websocket.close".
+        # That's expected on disconnect (the portal closes the socket on every tab
+        # switch), not an error — log it quietly.
+        msg = str(e)
+        if "after sending" in msg or "websocket.close" in msg or "response already completed" in msg:
+            logger.debug(f"Monitor stream for {unit_id} closed mid-send (client gone)")
+        else:
+            logger.warning(f"Monitor stream error for {unit_id}: {e}")
+    finally:
+        gone.cancel()
+        await monitor.unsubscribe(queue)
+
+
+@router.post("/{unit_id}/monitor/start")
+async def monitor_start(unit_id: str, db: Session = Depends(get_db)):
+    """Enable 24/7 keepalive monitoring: persist monitor_enabled and start the feed
+    now, so alerting evaluates continuously even with no viewer. Survives restarts
+    (auto-started on boot from the persisted flag)."""
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if cfg:
+        cfg.monitor_enabled = True
+        db.commit()
+    from app.monitor import monitor_manager
+    monitor = await monitor_manager.get(unit_id)
+    await monitor.set_keepalive(True)
+    return {"status": "ok", "unit_id": unit_id, "monitor_enabled": True, "running": monitor.running}
+
+
+@router.post("/{unit_id}/monitor/stop")
+async def monitor_stop(unit_id: str, db: Session = Depends(get_db)):
+    """Disable keepalive monitoring: persist monitor_enabled=False and drop the
+    keepalive (the feed stops once no browser subscribers remain)."""
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if cfg:
+        cfg.monitor_enabled = False
+        db.commit()
+    from app.monitor import monitor_manager
+    monitor = await monitor_manager.get(unit_id)
+    await monitor.set_keepalive(False)
+    return {"status": "ok", "unit_id": unit_id, "monitor_enabled": False}
+
+
+@router.get("/_monitor/status")
+async def monitor_status():
+    """Status of every device monitor (running, subscriber count, keep-alive)."""
+    from app.monitor import monitor_manager
+    return {"status": "ok", "monitors": monitor_manager.status()}
+
+
+@router.get("/{unit_id}/history")
+def get_monitor_history(unit_id: str, hours: float = 2.0, db: Session = Depends(get_db)):
+    """Recent downsampled monitor readings (the DOD trail) for the live-chart
+    backfill. Viewing only — NOT the FTP report data."""
+    from datetime import timedelta
+    hours = max(0.1, min(hours, 48.0))
+    cutoff = datetime.utcnow() - timedelta(hours=hours)
+    rows = (db.query(NL43Reading)
+            .filter(NL43Reading.unit_id == unit_id, NL43Reading.timestamp >= cutoff)
+            .order_by(NL43Reading.timestamp.asc()).all())
+    return {
+        "status": "ok",
+        "unit_id": unit_id,
+        "hours": hours,
+        "count": len(rows),
+        "readings": [
+            {
+                "timestamp": r.timestamp.isoformat() if r.timestamp else None,
+                "lp": r.lp, "leq": r.leq, "lmax": r.lmax, "ln1": r.ln1, "ln2": r.ln2,
+            }
+            for r in rows
+        ],
+    }
+
+
+# ============================================================================
+# ALERTS — threshold rules + fired events
+# ============================================================================
+
+class AlertRulePayload(BaseModel):
+    name: str = "Alert"
+    metric: str = "lp"            # lp/leq/lmax/lmin/lpeak/ln1/ln2
+    comparison: str = "above"     # above | below
+    threshold_db: float
+    duration_s: int = 0           # sustained seconds before firing (0 = instant)
+    clear_margin_db: float = 2.0  # hysteresis band
+    cooldown_s: int = 300
+    schedule_start: str | None = None  # "HH:MM" local; null = always
+    schedule_end: str | None = None
+    schedule_days: str | None = None   # CSV of 0-6 (Mon=0); null = every day
+    channels: str = "log"
+    recipients: str | None = None
+    enabled: bool = True
+
+
+def _rule_dict(r: AlertRule) -> dict:
+    return {
+        "id": r.id, "unit_id": r.unit_id, "name": r.name, "metric": r.metric,
+        "comparison": r.comparison, "threshold_db": r.threshold_db,
+        "duration_s": r.duration_s, "clear_margin_db": r.clear_margin_db,
+        "cooldown_s": r.cooldown_s, "schedule_start": r.schedule_start,
+        "schedule_end": r.schedule_end, "schedule_days": r.schedule_days,
+        "channels": r.channels, "recipients": r.recipients, "enabled": r.enabled,
+    }
+
+
+def _event_dict(e: AlertEvent) -> dict:
+    return {
+        "id": e.id, "rule_id": e.rule_id, "unit_id": e.unit_id,
+        "rule_name": e.rule_name, "metric": e.metric, "threshold_db": e.threshold_db,
+        "onset_at": e.onset_at.isoformat() if e.onset_at else None,
+        "onset_value": e.onset_value, "peak_value": e.peak_value,
+        "clear_at": e.clear_at.isoformat() if e.clear_at else None,
+        "status": e.status,
+        "acknowledged_at": e.acknowledged_at.isoformat() if e.acknowledged_at else None,
+        "acknowledged_by": e.acknowledged_by,
+    }
+
+
+async def _sync_keepalive_to_rules(unit_id: str, db: Session):
+    """Keep a unit's monitor running while it has enabled alert rules, so the
+    evaluator runs 24/7 even with no browser watching. Turns keepalive ON (and
+    persists monitor_enabled so it survives a restart via the boot auto-start)
+    when enabled rules exist; never turns it OFF — a device may be kept alive for
+    other reasons, so operators control that on /admin/slmm."""
+    has_enabled = (db.query(AlertRule)
+                   .filter_by(unit_id=unit_id, enabled=True).first() is not None)
+    if not has_enabled:
+        return
+    cfg = db.query(NL43Config).filter_by(unit_id=unit_id).first()
+    if cfg and not cfg.monitor_enabled:
+        cfg.monitor_enabled = True
+        db.commit()
+    from app.monitor import monitor_manager
+    m = await monitor_manager.get(unit_id)
+    await m.set_keepalive(True)
+
+
+def _reset_rule_runtime(unit_id: str, rule_id: int, db: Session):
+    """After a rule edit/delete: drop its evaluator state machine and close any open
+    event, so a stale 'active' phase doesn't mis-evaluate against the new config and
+    the client portal doesn't stay 'in alarm' on a rule that changed or is gone."""
+    from app.alerts import alert_evaluator
+    alert_evaluator.forget_rule(unit_id, rule_id)
+    now = datetime.utcnow()
+    for evt in db.query(AlertEvent).filter_by(unit_id=unit_id, rule_id=rule_id, status="active").all():
+        evt.clear_at = now
+        evt.status = "cleared"
+    db.commit()
+
+
+@router.post("/{unit_id}/alerts/rules")
+async def create_alert_rule(unit_id: str, payload: AlertRulePayload, db: Session = Depends(get_db)):
+    rule = AlertRule(unit_id=unit_id, **payload.model_dump())
+    db.add(rule)
+    db.commit()
+    db.refresh(rule)
+    from app.alerts import alert_evaluator
+    alert_evaluator.invalidate(unit_id)
+    await _sync_keepalive_to_rules(unit_id, db)
+    return {"status": "ok", "rule": _rule_dict(rule)}
+
+
+@router.get("/{unit_id}/alerts/rules")
+def list_alert_rules(unit_id: str, db: Session = Depends(get_db)):
+    rules = db.query(AlertRule).filter_by(unit_id=unit_id).all()
+    return {"status": "ok", "rules": [_rule_dict(r) for r in rules]}
+
+
+@router.put("/{unit_id}/alerts/rules/{rule_id}")
+async def update_alert_rule(unit_id: str, rule_id: int, payload: AlertRulePayload, db: Session = Depends(get_db)):
+    rule = db.query(AlertRule).filter_by(id=rule_id, unit_id=unit_id).first()
+    if not rule:
+        raise HTTPException(status_code=404, detail="Alert rule not found")
+    for field, value in payload.model_dump().items():
+        setattr(rule, field, value)
+    db.commit()
+    db.refresh(rule)
+    from app.alerts import alert_evaluator
+    alert_evaluator.invalidate(unit_id)
+    _reset_rule_runtime(unit_id, rule_id, db)
+    await _sync_keepalive_to_rules(unit_id, db)
+    return {"status": "ok", "rule": _rule_dict(rule)}
+
+
+@router.delete("/{unit_id}/alerts/rules/{rule_id}")
+async def delete_alert_rule(unit_id: str, rule_id: int, db: Session = Depends(get_db)):
+    rule = db.query(AlertRule).filter_by(id=rule_id, unit_id=unit_id).first()
+    if not rule:
+        raise HTTPException(status_code=404, detail="Alert rule not found")
+    db.delete(rule)
+    db.commit()
+    from app.alerts import alert_evaluator
+    alert_evaluator.invalidate(unit_id)
+    _reset_rule_runtime(unit_id, rule_id, db)   # close its open event so the portal doesn't stay red
+    await _sync_keepalive_to_rules(unit_id, db)  # no-op if no enabled rules remain
+    return {"status": "ok", "deleted": rule_id}
+
+
+@router.get("/{unit_id}/alerts/events")
+def list_alert_events(unit_id: str, limit: int = 50, db: Session = Depends(get_db)):
+    events = (db.query(AlertEvent).filter_by(unit_id=unit_id)
+              .order_by(AlertEvent.onset_at.desc()).limit(limit).all())
+    return {"status": "ok", "events": [_event_dict(e) for e in events]}
+
+
+@router.post("/{unit_id}/alerts/events/{event_id}/ack")
+def ack_alert_event(unit_id: str, event_id: int, by: str | None = None, db: Session = Depends(get_db)):
+    evt = db.query(AlertEvent).filter_by(id=event_id, unit_id=unit_id).first()
+    if not evt:
+        raise HTTPException(status_code=404, detail="Alert event not found")
+    evt.acknowledged_at = datetime.utcnow()
+    evt.acknowledged_by = by
+    db.commit()
+    return {"status": "ok", "acknowledged": event_id}
+
+
 # ============================================================================
 # GLOBAL POLLING STATUS ENDPOINT (must be before /{unit_id} routes)
 # ============================================================================
@@ -168,6 +583,7 @@ def get_roster(db: Session = Depends(get_db)):
            "web_enabled": cfg.web_enabled,
            "poll_enabled": cfg.poll_enabled,
            "poll_interval_seconds": cfg.poll_interval_seconds,
+            "monitor_enabled": cfg.monitor_enabled,
            "status": None
        }

@@ -233,8 +649,8 @@ class RosterCreatePayload(BaseModel):
    @field_validator("poll_interval_seconds")
    @classmethod
    def validate_poll_interval(cls, v):
-        if v is not None and not (10 <= v <= 3600):
-            raise ValueError("Poll interval must be between 10 and 3600 seconds")
+        if v is not None and not (30 <= v <= 21600):
+            raise ValueError("Poll interval must be between 30 and 21600 seconds (30s to 6 hours)")
        return v


@@ -416,11 +832,14 @@ def get_status(unit_id: str, db: Session = Depends(get_db)):
            "unit_id": unit_id,
            "last_seen": status.last_seen.isoformat() if status.last_seen else None,
            "measurement_state": status.measurement_state,
+            "measurement_start_time": status.measurement_start_time.isoformat() if status.measurement_start_time else None,
            "lp": status.lp,
            "leq": status.leq,
            "lmax": status.lmax,
            "lmin": status.lmin,
            "lpeak": status.lpeak,
+            "ln1": status.ln1,
+            "ln2": status.ln2,
            "battery_level": status.battery_level,
            "power_source": status.power_source,
            "sd_remaining_mb": status.sd_remaining_mb,
@@ -443,6 +862,8 @@ class StatusPayload(BaseModel):
    lmax: str | None = None
    lmin: str | None = None
    lpeak: str | None = None
+    ln1: str | None = None
+    ln2: str | None = None
    battery_level: str | None = None
    power_source: str | None = None
    sd_remaining_mb: str | None = None
@@ -470,11 +891,14 @@ def upsert_status(unit_id: str, payload: StatusPayload, db: Session = Depends(ge
            "unit_id": unit_id,
            "last_seen": status.last_seen.isoformat(),
            "measurement_state": status.measurement_state,
+            "measurement_start_time": status.measurement_start_time.isoformat() if status.measurement_start_time else None,
            "lp": status.lp,
            "leq": status.leq,
            "lmax": status.lmax,
            "lmin": status.lmin,
            "lpeak": status.lpeak,
+            "ln1": status.ln1,
+            "ln2": status.ln2,
            "battery_level": status.battery_level,
            "power_source": status.power_source,
            "sd_remaining_mb": status.sd_remaining_mb,
@@ -515,7 +939,7 @@ async def start_measurement(unit_id: str, db: Session = Depends(get_db)):
            db.expire_all()
            status = db.query(NL43Status).filter_by(unit_id=unit_id).first()
            logger.info(f"State check: measurement_state={status.measurement_state if status else 'None'}, start_time={status.measurement_start_time if status else 'None'}")
-            if status and status.measurement_state == "Measure" and status.measurement_start_time:
+            if status and status.measurement_state in ("Start", "Measure") and status.measurement_start_time:
                logger.info(f"✓ Measurement state confirmed for {unit_id} with start time {status.measurement_start_time}")
                break

@@ -544,12 +968,6 @@ async def stop_measurement(unit_id: str, db: Session = Depends(get_db)):
    try:
        await client.stop()
        logger.info(f"Stopped measurement on unit {unit_id}")
-
-        # Query device status to update database with "Stop" state
-        snap = await client.request_dod()
-        snap.unit_id = unit_id
-        persist_snapshot(snap, db)
-
    except ConnectionError as e:
        logger.error(f"Failed to stop measurement on {unit_id}: {e}")
        raise HTTPException(status_code=502, detail="Failed to communicate with device")
@@ -559,6 +977,15 @@ async def stop_measurement(unit_id: str, db: Session = Depends(get_db)):
    except Exception as e:
        logger.error(f"Unexpected error stopping measurement on {unit_id}: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")
+
+    # Query device status to update database — non-fatal if this fails
+    try:
+        snap = await client.request_dod()
+        snap.unit_id = unit_id
+        persist_snapshot(snap, db)
+    except Exception as e:
+        logger.warning(f"Stop succeeded but failed to update status for {unit_id}: {e}")
+
    return {"status": "ok", "message": "Measurement stopped"}


@@ -656,8 +1083,9 @@ async def stop_cycle(unit_id: str, payload: StopCyclePayload = None, db: Session
        return {"status": "ok", "unit_id": unit_id, **result}

    except Exception as e:
-        logger.error(f"Stop cycle failed for {unit_id}: {e}")
-        raise HTTPException(status_code=502, detail=str(e))
+        error_msg = str(e) if str(e) else f"{type(e).__name__}: No details available"
+        logger.error(f"Stop cycle failed for {unit_id}: {error_msg}")
+        raise HTTPException(status_code=502, detail=error_msg)


@router.post("/{unit_id}/store")
@@ -1172,6 +1600,8 @@ async def stream_live(websocket: WebSocket, unit_id: str):
                    "lmax": snap.lmax,  # Maximum level
                    "lmin": snap.lmin,  # Minimum level
                    "lpeak": snap.lpeak,  # Peak level
+                    "ln1": snap.ln1,    # LN1 percentile (L1/L10 contract); null on DRD stream
+                    "ln2": snap.ln2,    # LN2 percentile; null on DRD stream
                    "raw_payload": snap.raw_payload,
                })
            except Exception as e:
@@ -1722,74 +2152,38 @@ async def run_diagnostics(unit_id: str, db: Session = Depends(get_db)):
        "message": "TCP communication enabled"
    }

-    # Test 3: Modem/Router reachable (check port 443 HTTPS)
+    # Test 3: TCP connection reachable (device port) — uses connection pool
+    # This avoids extra TCP handshakes over cellular. If a cached connection
+    # exists and is alive, we skip the handshake entirely.
+    from app.services import _connection_pool
+    device_key = f"{cfg.host}:{cfg.tcp_port}"
    try:
-        reader, writer = await asyncio.wait_for(
-            asyncio.open_connection(cfg.host, 443), timeout=3.0
-        )
-        writer.close()
-        await writer.wait_closed()
-        diagnostics["tests"]["modem_reachable"] = {
-            "status": "pass",
-            "message": f"Modem/router reachable at {cfg.host}"
-        }
-    except asyncio.TimeoutError:
-        diagnostics["tests"]["modem_reachable"] = {
-            "status": "fail",
-            "message": f"Modem/router timeout at {cfg.host} (network issue)"
-        }
-        diagnostics["overall_status"] = "fail"
-        return diagnostics
-    except ConnectionRefusedError:
-        # Connection refused means host is up but port 443 closed - that's ok
-        diagnostics["tests"]["modem_reachable"] = {
-            "status": "pass",
-            "message": f"Modem/router reachable at {cfg.host} (HTTPS closed)"
-        }
-    except Exception as e:
-        diagnostics["tests"]["modem_reachable"] = {
-            "status": "fail",
-            "message": f"Cannot reach modem/router at {cfg.host}: {str(e)}"
-        }
-        diagnostics["overall_status"] = "fail"
-        return diagnostics
-
-    # Test 4: TCP connection reachable (device port)
-    try:
-        reader, writer = await asyncio.wait_for(
-            asyncio.open_connection(cfg.host, cfg.tcp_port), timeout=3.0
-        )
-        writer.close()
-        await writer.wait_closed()
-        diagnostics["tests"]["tcp_connection"] = {
-            "status": "pass",
-            "message": f"TCP connection successful to {cfg.host}:{cfg.tcp_port}"
-        }
-    except asyncio.TimeoutError:
-        diagnostics["tests"]["tcp_connection"] = {
-            "status": "fail",
-            "message": f"Connection timeout to {cfg.host}:{cfg.tcp_port}"
-        }
-        diagnostics["overall_status"] = "fail"
-        return diagnostics
-    except ConnectionRefusedError:
-        diagnostics["tests"]["tcp_connection"] = {
-            "status": "fail",
-            "message": f"Connection refused by {cfg.host}:{cfg.tcp_port}"
-        }
-        diagnostics["overall_status"] = "fail"
-        return diagnostics
+        pool_stats = _connection_pool.get_stats()
+        conn_info = pool_stats["connections"].get(device_key)
+        if conn_info and conn_info["alive"]:
+            # Pool already has a live connection — device is reachable
+            diagnostics["tests"]["tcp_connection"] = {
+                "status": "pass",
+                "message": f"TCP connection alive in pool for {cfg.host}:{cfg.tcp_port}"
+            }
+        else:
+            # Acquire through the pool (opens new if needed, keeps it cached)
+            reader, writer, from_cache = await _connection_pool.acquire(
+                device_key, cfg.host, cfg.tcp_port, timeout=3.0
+            )
+            await _connection_pool.release(device_key, reader, writer, cfg.host, cfg.tcp_port)
+            diagnostics["tests"]["tcp_connection"] = {
+                "status": "pass",
+                "message": f"TCP connection successful to {cfg.host}:{cfg.tcp_port}"
+            }
    except Exception as e:
        diagnostics["tests"]["tcp_connection"] = {
            "status": "fail",
-            "message": f"Connection error: {str(e)}"
+            "message": f"Connection error to {cfg.host}:{cfg.tcp_port}: {str(e)}"
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics

-    # Wait a bit after connection test to let device settle
-    await asyncio.sleep(1.5)
-
    # Test 5: Device responds to commands
    # Use longer timeout to account for rate limiting (device requires ≥1s between commands)
    client = NL43Client(cfg.host, cfg.tcp_port, timeout=10.0, ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password)
@@ -1842,9 +2236,136 @@ async def run_diagnostics(unit_id: str, db: Session = Depends(get_db)):

    # All tests passed
    diagnostics["overall_status"] = "pass"
+
+    # Add database dump: config and status cache
+    diagnostics["database_dump"] = {
+        "config": {
+            "unit_id": cfg.unit_id,
+            "host": cfg.host,
+            "tcp_port": cfg.tcp_port,
+            "tcp_enabled": cfg.tcp_enabled,
+            "ftp_enabled": cfg.ftp_enabled,
+            "ftp_port": cfg.ftp_port,
+            "ftp_username": cfg.ftp_username,
+            "ftp_password": "***" if cfg.ftp_password else None,  # Mask password
+            "web_enabled": cfg.web_enabled,
+            "poll_interval_seconds": cfg.poll_interval_seconds,
+            "poll_enabled": cfg.poll_enabled
+        },
+        "status_cache": None
+    }
+
+    # Get cached status if available
+    status = db.query(NL43Status).filter_by(unit_id=unit_id).first()
+    if status:
+        # Helper to format datetime as ISO with Z suffix to indicate UTC
+        def to_utc_iso(dt):
+            return dt.isoformat() + 'Z' if dt else None
+
+        diagnostics["database_dump"]["status_cache"] = {
+            "unit_id": status.unit_id,
+            "last_seen": to_utc_iso(status.last_seen),
+            "measurement_state": status.measurement_state,
+            "measurement_start_time": to_utc_iso(status.measurement_start_time),
+            "counter": status.counter,
+            "lp": status.lp,
+            "leq": status.leq,
+            "lmax": status.lmax,
+            "lmin": status.lmin,
+            "lpeak": status.lpeak,
+            "ln1": status.ln1,
+            "ln2": status.ln2,
+            "battery_level": status.battery_level,
+            "power_source": status.power_source,
+            "sd_remaining_mb": status.sd_remaining_mb,
+            "sd_free_ratio": status.sd_free_ratio,
+            "is_reachable": status.is_reachable,
+            "consecutive_failures": status.consecutive_failures,
+            "last_poll_attempt": to_utc_iso(status.last_poll_attempt),
+            "last_success": to_utc_iso(status.last_success),
+            "last_error": status.last_error,
+            "raw_payload": status.raw_payload
+        }
+
    return diagnostics


+# ============================================================================
+# DEVICE LOGS ENDPOINTS
+# ============================================================================
+
+@router.get("/{unit_id}/logs")
+def get_device_logs(
+    unit_id: str,
+    limit: int = 100,
+    offset: int = 0,
+    level: Optional[str] = None,
+    category: Optional[str] = None,
+    db: Session = Depends(get_db)
+):
+    """
+    Get log entries for a specific device.
+
+    Query parameters:
+    - limit: Max entries to return (default: 100, max: 1000)
+    - offset: Number of entries to skip (for pagination)
+    - level: Filter by level (DEBUG, INFO, WARNING, ERROR)
+    - category: Filter by category (TCP, FTP, POLL, COMMAND, STATE, SYNC)
+
+    Returns newest entries first.
+    """
+    from app.device_logger import get_device_logs as fetch_logs, get_log_stats
+
+    # Validate limit
+    limit = min(limit, 1000)
+
+    logs = fetch_logs(
+        unit_id=unit_id,
+        limit=limit,
+        offset=offset,
+        level=level,
+        category=category,
+        db=db
+    )
+
+    stats = get_log_stats(unit_id, db)
+
+    return {
+        "status": "ok",
+        "unit_id": unit_id,
+        "logs": logs,
+        "count": len(logs),
+        "stats": stats,
+        "filters": {
+            "level": level,
+            "category": category
+        },
+        "pagination": {
+            "limit": limit,
+            "offset": offset
+        }
+    }
+
+
+@router.delete("/{unit_id}/logs")
+def clear_device_logs(unit_id: str, db: Session = Depends(get_db)):
+    """
+    Clear all log entries for a specific device.
+    """
+    from app.models import DeviceLog
+
+    deleted = db.query(DeviceLog).filter(DeviceLog.unit_id == unit_id).delete()
+    db.commit()
+
+    logger.info(f"Cleared {deleted} log entries for device {unit_id}")
+
+    return {
+        "status": "ok",
+        "message": f"Cleared {deleted} log entries for {unit_id}",
+        "deleted_count": deleted
+    }
+
+
 # ============================================================================
 # BACKGROUND POLLING CONFIGURATION ENDPOINTS
 # ============================================================================
@@ -1880,7 +2401,7 @@ def update_polling_config(
    """
    Update background polling configuration for a device.

-    Allows configuring the polling interval (10-3600 seconds) and
+    Allows configuring the polling interval (30-21600 seconds, i.e. 30s to 6 hours) and
    enabling/disabling automatic background polling per device.

    Changes take effect on the next polling cycle.
@@ -1891,10 +2412,15 @@ def update_polling_config(

    # Update interval if provided
    if payload.poll_interval_seconds is not None:
-        if payload.poll_interval_seconds < 10:
+        if payload.poll_interval_seconds < 30:
            raise HTTPException(
                status_code=400,
-                detail="Polling interval must be at least 10 seconds"
+                detail="Polling interval must be at least 30 seconds"
+            )
+        if payload.poll_interval_seconds > 21600:
+            raise HTTPException(
+                status_code=400,
+                detail="Polling interval must be at most 21600 seconds (6 hours)"
            )
        cfg.poll_interval_seconds = payload.poll_interval_seconds

@@ -0,0 +1,73 @@
+#!/usr/bin/env python3
+"""
+Database migration: Add device_logs table.
+
+This table stores per-device log entries for debugging and audit trail.
+
+Run this once to add the new table.
+"""
+
+import sqlite3
+import os
+
+# Path to the SLMM database
+DB_PATH = os.path.join(os.path.dirname(__file__), "data", "slmm.db")
+
+
+def migrate():
+    print(f"Adding device_logs table to: {DB_PATH}")
+
+    if not os.path.exists(DB_PATH):
+        print("Database does not exist yet. Table will be created automatically on first run.")
+        return
+
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+
+    try:
+        # Check if table already exists
+        cursor.execute("""
+            SELECT name FROM sqlite_master
+            WHERE type='table' AND name='device_logs'
+        """)
+        if cursor.fetchone():
+            print("✓ device_logs table already exists, no migration needed")
+            return
+
+        # Create the table
+        print("Creating device_logs table...")
+        cursor.execute("""
+            CREATE TABLE device_logs (
+                id INTEGER PRIMARY KEY AUTOINCREMENT,
+                unit_id VARCHAR NOT NULL,
+                timestamp DATETIME DEFAULT CURRENT_TIMESTAMP,
+                level VARCHAR DEFAULT 'INFO',
+                category VARCHAR DEFAULT 'GENERAL',
+                message TEXT NOT NULL
+            )
+        """)
+
+        # Create indexes for efficient querying
+        print("Creating indexes...")
+        cursor.execute("CREATE INDEX ix_device_logs_unit_id ON device_logs (unit_id)")
+        cursor.execute("CREATE INDEX ix_device_logs_timestamp ON device_logs (timestamp)")
+
+        conn.commit()
+        print("✓ Created device_logs table with indexes")
+
+        # Verify
+        cursor.execute("""
+            SELECT name FROM sqlite_master
+            WHERE type='table' AND name='device_logs'
+        """)
+        if not cursor.fetchone():
+            raise Exception("device_logs table was not created successfully")
+
+        print("✓ Migration completed successfully")
+
+    finally:
+        conn.close()
+
+
+if __name__ == "__main__":
+    migrate()
@@ -0,0 +1,58 @@
+#!/usr/bin/env python3
+"""
+Migration script to add ln1 and ln2 percentile columns to the nl43_status table.
+
+The NL-43 DOD response carries percentile slots LN1-LN5; the live SLM display
+(Terra-View) shows two of them (default L1/L10). This adds storage for the two
+surfaced slots. Run once per database to update existing schema.
+"""
+
+import sqlite3
+import sys
+from pathlib import Path
+
+DB_PATH = Path(__file__).parent / "data" / "slmm.db"
+
+
+def migrate():
+    """Add ln1 and ln2 columns to the nl43_status table."""
+
+    if not DB_PATH.exists():
+        print(f"Database not found at {DB_PATH}")
+        print("No migration needed - database will be created with new schema")
+        return
+
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+
+    try:
+        cursor.execute("PRAGMA table_info(nl43_status)")
+        columns = [row[1] for row in cursor.fetchall()]
+
+        if "ln1" in columns and "ln2" in columns:
+            print("✓ ln1/ln2 columns already exist, no migration needed")
+            return
+
+        if "ln1" not in columns:
+            print("Adding ln1 column...")
+            cursor.execute("ALTER TABLE nl43_status ADD COLUMN ln1 TEXT")
+            print("✓ Added ln1 column")
+
+        if "ln2" not in columns:
+            print("Adding ln2 column...")
+            cursor.execute("ALTER TABLE nl43_status ADD COLUMN ln2 TEXT")
+            print("✓ Added ln2 column")
+
+        conn.commit()
+        print("\n✓ Migration completed successfully!")
+
+    except Exception as e:
+        conn.rollback()
+        print(f"✗ Migration failed: {e}", file=sys.stderr)
+        sys.exit(1)
+    finally:
+        conn.close()
+
+
+if __name__ == "__main__":
+    migrate()
@@ -0,0 +1,48 @@
+#!/usr/bin/env python3
+"""
+Migration: add monitor_enabled column to nl43_config.
+
+Controls whether the live fan-out DOD monitor is kept alive 24/7 for a unit
+(which is what makes alerting continuous). Defaults to enabled. Run once per DB.
+"""
+
+import sqlite3
+import sys
+from pathlib import Path
+
+DB_PATH = Path(__file__).parent / "data" / "slmm.db"
+
+
+def migrate():
+    if not DB_PATH.exists():
+        print(f"Database not found at {DB_PATH}")
+        print("No migration needed - database will be created with new schema")
+        return
+
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+    try:
+        cursor.execute("PRAGMA table_info(nl43_config)")
+        columns = [row[1] for row in cursor.fetchall()]
+
+        if "monitor_enabled" in columns:
+            print("✓ monitor_enabled column already exists, no migration needed")
+            return
+
+        print("Adding monitor_enabled column (default enabled)...")
+        # SQLite stores booleans as 0/1; default 1 = enabled.
+        cursor.execute("ALTER TABLE nl43_config ADD COLUMN monitor_enabled BOOLEAN DEFAULT 1")
+        conn.commit()
+        print("✓ Added monitor_enabled column")
+        print("\n✓ Migration completed successfully!")
+
+    except Exception as e:
+        conn.rollback()
+        print(f"✗ Migration failed: {e}", file=sys.stderr)
+        sys.exit(1)
+    finally:
+        conn.close()
+
+
+if __name__ == "__main__":
+    migrate()
@@ -0,0 +1,60 @@
+#!/usr/bin/env python3
+"""
+Database migration: Add start_time_sync_attempted field to nl43_status table.
+
+This field tracks whether FTP sync has been attempted for the current measurement,
+preventing repeated sync attempts when FTP fails.
+
+Run this once to add the new column.
+"""
+
+import sqlite3
+import os
+
+# Path to the SLMM database
+DB_PATH = os.path.join(os.path.dirname(__file__), "data", "slmm.db")
+
+
+def migrate():
+    print(f"Adding start_time_sync_attempted field to: {DB_PATH}")
+
+    if not os.path.exists(DB_PATH):
+        print("Database does not exist yet. Column will be created automatically.")
+        return
+
+    conn = sqlite3.connect(DB_PATH)
+    cursor = conn.cursor()
+
+    try:
+        # Check if column already exists
+        cursor.execute("PRAGMA table_info(nl43_status)")
+        columns = [col[1] for col in cursor.fetchall()]
+
+        if 'start_time_sync_attempted' in columns:
+            print("✓ start_time_sync_attempted column already exists, no migration needed")
+            return
+
+        # Add the column
+        print("Adding start_time_sync_attempted column...")
+        cursor.execute("""
+            ALTER TABLE nl43_status
+            ADD COLUMN start_time_sync_attempted BOOLEAN DEFAULT 0
+        """)
+        conn.commit()
+        print("✓ Added start_time_sync_attempted column")
+
+        # Verify
+        cursor.execute("PRAGMA table_info(nl43_status)")
+        columns = [col[1] for col in cursor.fetchall()]
+
+        if 'start_time_sync_attempted' not in columns:
+            raise Exception("start_time_sync_attempted column was not added successfully")
+
+        print("✓ Migration completed successfully")
+
+    finally:
+        conn.close()
+
+
+if __name__ == "__main__":
+    migrate()
@@ -333,6 +333,134 @@

        html += `<p style="margin-top: 12px; font-size: 0.9em; color: #666;">Last run: ${new Date(data.timestamp).toLocaleString()}</p>`;

+        // Add database dump section if available
+        if (data.database_dump) {
+          html += `<div style="margin-top: 16px; border-top: 1px solid #d0d7de; padding-top: 12px;">`;
+          html += `<h4 style="margin: 0 0 12px 0;">📦 Database Dump</h4>`;
+
+          // Config section
+          if (data.database_dump.config) {
+            const cfg = data.database_dump.config;
+            html += `<div style="background: #f0f4f8; padding: 12px; border-radius: 4px; margin-bottom: 12px;">`;
+            html += `<strong>Configuration (nl43_config)</strong>`;
+            html += `<table style="width: 100%; margin-top: 8px; font-size: 0.9em;">`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Host</td><td>${cfg.host}:${cfg.tcp_port}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">TCP Enabled</td><td>${cfg.tcp_enabled ? '✓' : '✗'}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">FTP Enabled</td><td>${cfg.ftp_enabled ? '✓' : '✗'}${cfg.ftp_enabled ? ` (port ${cfg.ftp_port}, user: ${cfg.ftp_username || 'none'})` : ''}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Background Polling</td><td>${cfg.poll_enabled ? `✓ every ${cfg.poll_interval_seconds}s` : '✗ disabled'}</td></tr>`;
+            html += `</table></div>`;
+          }
+
+          // Status cache section
+          if (data.database_dump.status_cache) {
+            const cache = data.database_dump.status_cache;
+            html += `<div style="background: #f0f8f4; padding: 12px; border-radius: 4px; margin-bottom: 12px;">`;
+            html += `<strong>Status Cache (nl43_status)</strong>`;
+            html += `<table style="width: 100%; margin-top: 8px; font-size: 0.9em;">`;
+
+            // Measurement state and timing
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Measurement State</td><td><strong>${cache.measurement_state || 'unknown'}</strong></td></tr>`;
+            if (cache.measurement_start_time) {
+              const startTime = new Date(cache.measurement_start_time);
+              const elapsed = Math.floor((Date.now() - startTime) / 1000);
+              const elapsedStr = elapsed > 3600 ? `${Math.floor(elapsed/3600)}h ${Math.floor((elapsed%3600)/60)}m` : elapsed > 60 ? `${Math.floor(elapsed/60)}m ${elapsed%60}s` : `${elapsed}s`;
+              html += `<tr><td style="padding: 2px 8px; color: #666;">Measurement Started</td><td>${startTime.toLocaleString()} (${elapsedStr} ago)</td></tr>`;
+            }
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Counter (d0)</td><td>${cache.counter || 'N/A'}</td></tr>`;
+
+            // Sound levels
+            html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Sound Levels (dB)</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Lp (Instantaneous)</td><td>${cache.lp || 'N/A'}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Leq (Equivalent)</td><td>${cache.leq || 'N/A'}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Lmax / Lmin</td><td>${cache.lmax || 'N/A'} / ${cache.lmin || 'N/A'}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Lpeak</td><td>${cache.lpeak || 'N/A'}</td></tr>`;
+
+            // Device status
+            html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Device Status</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Battery</td><td>${cache.battery_level || 'N/A'}${cache.power_source ? ` (${cache.power_source})` : ''}</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">SD Card</td><td>${cache.sd_remaining_mb ? `${cache.sd_remaining_mb} MB` : 'N/A'}${cache.sd_free_ratio ? ` (${cache.sd_free_ratio} free)` : ''}</td></tr>`;
+
+            // Polling status
+            html += `<tr><td colspan="2" style="padding: 8px 8px 2px 8px; font-weight: 600; border-top: 1px solid #d0d7de;">Polling Status</td></tr>`;
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Reachable</td><td>${cache.is_reachable ? '🟢 Yes' : '🔴 No'}</td></tr>`;
+            if (cache.last_seen) {
+              html += `<tr><td style="padding: 2px 8px; color: #666;">Last Seen</td><td>${new Date(cache.last_seen).toLocaleString()}</td></tr>`;
+            }
+            if (cache.last_success) {
+              html += `<tr><td style="padding: 2px 8px; color: #666;">Last Success</td><td>${new Date(cache.last_success).toLocaleString()}</td></tr>`;
+            }
+            if (cache.last_poll_attempt) {
+              html += `<tr><td style="padding: 2px 8px; color: #666;">Last Poll Attempt</td><td>${new Date(cache.last_poll_attempt).toLocaleString()}</td></tr>`;
+            }
+            html += `<tr><td style="padding: 2px 8px; color: #666;">Consecutive Failures</td><td>${cache.consecutive_failures || 0}</td></tr>`;
+            if (cache.last_error) {
+              html += `<tr><td style="padding: 2px 8px; color: #666;">Last Error</td><td style="color: #d00; font-size: 0.85em;">${cache.last_error}</td></tr>`;
+            }
+
+            html += `</table></div>`;
+
+            // Raw payload (collapsible)
+            if (cache.raw_payload) {
+              html += `<details style="margin-top: 8px;"><summary style="cursor: pointer; color: #666; font-size: 0.9em;">📄 Raw Payload</summary>`;
+              html += `<pre style="background: #f6f8fa; padding: 8px; border-radius: 4px; font-size: 0.8em; overflow-x: auto; margin-top: 8px;">${cache.raw_payload}</pre></details>`;
+            }
+          } else {
+            html += `<p style="color: #888; font-style: italic;">No cached status available for this unit.</p>`;
+          }
+
+          html += `</div>`;
+        }
+
+        // Fetch and display device logs
+        try {
+          const logsRes = await fetch(`/api/nl43/${unitId}/logs?limit=50`);
+          if (logsRes.ok) {
+            const logsData = await logsRes.json();
+            if (logsData.logs && logsData.logs.length > 0) {
+              html += `<div style="margin-top: 16px; border-top: 1px solid #d0d7de; padding-top: 12px;">`;
+              html += `<h4 style="margin: 0 0 12px 0;">📋 Device Logs (${logsData.stats.total} total)</h4>`;
+
+              // Stats summary
+              if (logsData.stats.by_level) {
+                html += `<div style="margin-bottom: 8px; font-size: 0.85em; color: #666;">`;
+                const levels = logsData.stats.by_level;
+                const parts = [];
+                if (levels.ERROR) parts.push(`<span style="color: #d00;">${levels.ERROR} errors</span>`);
+                if (levels.WARNING) parts.push(`<span style="color: #fa0;">${levels.WARNING} warnings</span>`);
+                if (levels.INFO) parts.push(`${levels.INFO} info`);
+                html += parts.join(' · ');
+                html += `</div>`;
+              }
+
+              // Log entries (collapsible)
+              html += `<details open><summary style="cursor: pointer; font-size: 0.9em; margin-bottom: 8px;">Recent entries (${logsData.logs.length})</summary>`;
+              html += `<div style="max-height: 300px; overflow-y: auto; background: #f6f8fa; border: 1px solid #d0d7de; border-radius: 4px; padding: 8px; font-size: 0.8em; font-family: monospace;">`;
+
+              logsData.logs.forEach(entry => {
+                const levelColor = {
+                  'ERROR': '#d00',
+                  'WARNING': '#b86e00',
+                  'INFO': '#0969da',
+                  'DEBUG': '#888'
+                }[entry.level] || '#666';
+
+                const time = new Date(entry.timestamp).toLocaleString();
+                html += `<div style="margin-bottom: 4px; border-bottom: 1px solid #eee; padding-bottom: 4px;">`;
+                html += `<span style="color: #888;">${time}</span> `;
+                html += `<span style="color: ${levelColor}; font-weight: 600;">[${entry.level}]</span> `;
+                html += `<span style="color: #666;">[${entry.category}]</span> `;
+                html += `${entry.message}`;
+                html += `</div>`;
+              });
+
+              html += `</div></details>`;
+              html += `</div>`;
+            }
+          }
+        } catch (logErr) {
+          console.log('Could not fetch device logs:', logErr);
+        }
+
        resultsEl.innerHTML = html;
        log(`Diagnostics complete: ${data.overall_status}`);

@@ -3,7 +3,7 @@
 <head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>SLMM Roster - Sound Level Meter Configuration</title>
+  <title>SLMM - Device Roster &amp; Connections</title>
  <style>
    * { box-sizing: border-box; }
    body {
@@ -227,41 +227,165 @@
    }
    .toast-success { background: #2da44e; }
    .toast-error { background: #cf222e; }
+
+    /* Tabs */
+    .tabs {
+      display: flex;
+      gap: 0;
+      margin-bottom: 0;
+      border-bottom: 2px solid #d0d7de;
+    }
+    .tab-btn {
+      padding: 10px 20px;
+      border: none;
+      background: none;
+      cursor: pointer;
+      font-size: 14px;
+      font-weight: 600;
+      color: #57606a;
+      border-bottom: 2px solid transparent;
+      margin-bottom: -2px;
+      transition: color 0.2s, border-color 0.2s;
+    }
+    .tab-btn:hover { color: #24292f; }
+    .tab-btn.active {
+      color: #24292f;
+      border-bottom-color: #fd8c73;
+    }
+    .tab-panel { display: none; }
+    .tab-panel.active { display: block; }
+
+    /* Connection pool panel */
+    .pool-config {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(180px, 1fr));
+      gap: 12px;
+      margin-bottom: 20px;
+    }
+    .pool-config-card {
+      background: #f6f8fa;
+      border: 1px solid #d0d7de;
+      border-radius: 6px;
+      padding: 12px;
+    }
+    .pool-config-card .label {
+      font-size: 11px;
+      color: #57606a;
+      text-transform: uppercase;
+      font-weight: 600;
+      margin-bottom: 4px;
+    }
+    .pool-config-card .value {
+      font-size: 18px;
+      font-weight: 600;
+      color: #24292f;
+    }
+    .conn-card {
+      background: white;
+      border: 1px solid #d0d7de;
+      border-radius: 6px;
+      padding: 16px;
+      margin-bottom: 12px;
+    }
+    .conn-card-header {
+      display: flex;
+      justify-content: space-between;
+      align-items: center;
+      margin-bottom: 12px;
+    }
+    .conn-card-header strong { font-size: 15px; }
+    .conn-card-grid {
+      display: grid;
+      grid-template-columns: repeat(auto-fill, minmax(140px, 1fr));
+      gap: 8px;
+    }
+    .conn-stat .label {
+      font-size: 11px;
+      color: #57606a;
+      text-transform: uppercase;
+      font-weight: 600;
+    }
+    .conn-stat .value {
+      font-size: 14px;
+      font-weight: 600;
+      color: #24292f;
+    }
+    .conn-empty {
+      text-align: center;
+      padding: 32px;
+      color: #57606a;
+    }
+    .pool-actions {
+      display: flex;
+      gap: 8px;
+      margin-bottom: 16px;
+    }
  </style>
 </head>
 <body>
  <div class="container">
    <div class="header">
-      <h1>📊 Sound Level Meter Roster</h1>
+      <h1>SLMM - Roster &amp; Connections</h1>
      <div class="nav">
-        <a href="/" class="btn">← Back to Control Panel</a>
+        <a href="/" class="btn">&larr; Back to Control Panel</a>
        <button class="btn btn-primary" onclick="openAddModal()">+ Add Device</button>
      </div>
    </div>

-    <div class="table-container">
-      <table id="rosterTable">
-        <thead>
-          <tr>
-            <th>Unit ID</th>
-            <th>Host / IP</th>
-            <th>TCP Port</th>
-            <th>FTP Port</th>
-            <th class="checkbox-cell">TCP</th>
-            <th class="checkbox-cell">FTP</th>
-            <th class="checkbox-cell">Polling</th>
-            <th>Status</th>
-            <th class="actions-cell">Actions</th>
-          </tr>
-        </thead>
-        <tbody id="rosterBody">
-          <tr>
-            <td colspan="9" style="text-align: center; padding: 24px;">
-              Loading...
-            </td>
-          </tr>
-        </tbody>
-      </table>
+    <div class="tabs">
+      <button class="tab-btn active" onclick="switchTab('roster')">Device Roster</button>
+      <button class="tab-btn" onclick="switchTab('connections')">Connections</button>
+    </div>
+
+    <!-- Roster Tab -->
+    <div id="tab-roster" class="tab-panel active">
+      <div class="table-container" style="border-top-left-radius: 0; border-top-right-radius: 0;">
+        <table id="rosterTable">
+          <thead>
+            <tr>
+              <th>Unit ID</th>
+              <th>Host / IP</th>
+              <th>TCP Port</th>
+              <th>FTP Port</th>
+              <th class="checkbox-cell">TCP</th>
+              <th class="checkbox-cell">FTP</th>
+              <th class="checkbox-cell">Polling</th>
+              <th>Status</th>
+              <th class="actions-cell">Actions</th>
+            </tr>
+          </thead>
+          <tbody id="rosterBody">
+            <tr>
+              <td colspan="9" style="text-align: center; padding: 24px;">
+                Loading...
+              </td>
+            </tr>
+          </tbody>
+        </table>
+      </div>
+    </div>
+
+    <!-- Connections Tab -->
+    <div id="tab-connections" class="tab-panel">
+      <div class="table-container" style="padding: 20px; border-top-left-radius: 0; border-top-right-radius: 0;">
+        <div class="pool-actions">
+          <button class="btn" onclick="loadConnections()">Refresh</button>
+          <button class="btn btn-danger" onclick="flushConnections()">Flush All Connections</button>
+        </div>
+
+        <h3 style="margin: 0 0 12px 0; font-size: 16px;">Pool Configuration</h3>
+        <div id="poolConfig" class="pool-config">
+          <div class="pool-config-card">
+            <div class="label">Status</div>
+            <div class="value" id="poolEnabled">--</div>
+          </div>
+        </div>
+
+        <h3 style="margin: 20px 0 12px 0; font-size: 16px;">Active Connections</h3>
+        <div id="connectionsList">
+          <div class="conn-empty">Loading...</div>
+        </div>
+      </div>
    </div>
  </div>

@@ -619,6 +743,159 @@
        closeModal();
      }
    });
+
+    // ========== Tab Switching ==========
+
+    function switchTab(tabName) {
+      document.querySelectorAll('.tab-btn').forEach(btn => btn.classList.remove('active'));
+      document.querySelectorAll('.tab-panel').forEach(panel => panel.classList.remove('active'));
+
+      document.querySelector(`.tab-btn[onclick="switchTab('${tabName}')"]`).classList.add('active');
+      document.getElementById(`tab-${tabName}`).classList.add('active');
+
+      if (tabName === 'connections') {
+        loadConnections();
+      }
+    }
+
+    // ========== Connection Pool ==========
+
+    let connectionsRefreshTimer = null;
+
+    async function loadConnections() {
+      try {
+        const res = await fetch('/api/nl43/_connections/status');
+        const data = await res.json();
+
+        if (!res.ok) {
+          showToast('Failed to load connection pool status', 'error');
+          return;
+        }
+
+        const pool = data.pool;
+        renderPoolConfig(pool);
+        renderConnections(pool.connections);
+
+        // Auto-refresh while tab is active
+        clearTimeout(connectionsRefreshTimer);
+        if (document.getElementById('tab-connections').classList.contains('active')) {
+          connectionsRefreshTimer = setTimeout(loadConnections, 5000);
+        }
+      } catch (err) {
+        showToast('Error loading connections: ' + err.message, 'error');
+        console.error('Load connections error:', err);
+      }
+    }
+
+    function renderPoolConfig(pool) {
+      document.getElementById('poolConfig').innerHTML = `
+        <div class="pool-config-card">
+          <div class="label">Persistent</div>
+          <div class="value" style="color: ${pool.enabled ? '#1a7f37' : '#cf222e'}">${pool.enabled ? 'Enabled' : 'Disabled'}</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">Active</div>
+          <div class="value">${pool.active_connections}</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">Idle TTL</div>
+          <div class="value">${pool.idle_ttl}s</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">Max Age</div>
+          <div class="value">${pool.max_age}s</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">KA Idle</div>
+          <div class="value">${pool.keepalive_idle}s</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">KA Interval</div>
+          <div class="value">${pool.keepalive_interval}s</div>
+        </div>
+        <div class="pool-config-card">
+          <div class="label">KA Probes</div>
+          <div class="value">${pool.keepalive_count}</div>
+        </div>
+      `;
+    }
+
+    function renderConnections(connections) {
+      const container = document.getElementById('connectionsList');
+      const keys = Object.keys(connections);
+
+      if (keys.length === 0) {
+        container.innerHTML = `
+          <div class="conn-empty">
+            <div style="font-size: 32px; margin-bottom: 8px;">~</div>
+            <div><strong>No active connections</strong></div>
+            <div style="margin-top: 4px; font-size: 13px;">
+              Connections appear here when devices are actively being polled and the connection is cached between commands.
+            </div>
+          </div>
+        `;
+        return;
+      }
+
+      container.innerHTML = keys.map(key => {
+        const conn = connections[key];
+        const aliveColor = conn.alive ? '#1a7f37' : '#cf222e';
+        const aliveText = conn.alive ? 'Alive' : 'Stale';
+        return `
+          <div class="conn-card">
+            <div class="conn-card-header">
+              <strong>${escapeHtml(key)}</strong>
+              <span class="status-badge ${conn.alive ? 'status-ok' : 'status-error'}">${aliveText}</span>
+            </div>
+            <div class="conn-card-grid">
+              <div class="conn-stat">
+                <div class="label">Host</div>
+                <div class="value">${escapeHtml(conn.host)}</div>
+              </div>
+              <div class="conn-stat">
+                <div class="label">Port</div>
+                <div class="value">${conn.port}</div>
+              </div>
+              <div class="conn-stat">
+                <div class="label">Age</div>
+                <div class="value">${formatSeconds(conn.age_seconds)}</div>
+              </div>
+              <div class="conn-stat">
+                <div class="label">Idle</div>
+                <div class="value">${formatSeconds(conn.idle_seconds)}</div>
+              </div>
+            </div>
+          </div>
+        `;
+      }).join('');
+    }
+
+    function formatSeconds(s) {
+      if (s < 60) return Math.round(s) + 's';
+      if (s < 3600) return Math.floor(s / 60) + 'm ' + Math.round(s % 60) + 's';
+      return Math.floor(s / 3600) + 'h ' + Math.floor((s % 3600) / 60) + 'm';
+    }
+
+    async function flushConnections() {
+      if (!confirm('Close all cached TCP connections?\n\nDevices will reconnect on the next poll cycle.')) {
+        return;
+      }
+
+      try {
+        const res = await fetch('/api/nl43/_connections/flush', { method: 'POST' });
+        const data = await res.json();
+
+        if (!res.ok) {
+          showToast(data.detail || 'Failed to flush connections', 'error');
+          return;
+        }
+
+        showToast('All connections flushed', 'success');
+        await loadConnections();
+      } catch (err) {
+        showToast('Error flushing connections: ' + err.message, 'error');
+      }
+    }
  </script>
 </body>
 </html>
@@ -0,0 +1,68 @@
+"""
+Synthetic unit test for the alert state machine — no DB, no device.
+
+Drives `_evaluate_step` with a fake clock + a level series and checks that
+onset/clear fire with the right debounce + hysteresis. Run:
+
+    docker compose exec -T slmm python3 test_alert_evaluator.py
+    # or, if app.alerts imports cleanly standalone:  python3 test_alert_evaluator.py
+"""
+
+from types import SimpleNamespace
+from app.alerts import RuleState, _evaluate_step
+
+
+def rule(**kw):
+    base = dict(threshold_db=85.0, duration_s=3, clear_margin_db=2.0, comparison="above")
+    base.update(kw)
+    return SimpleNamespace(**base)
+
+
+def run(series, r):
+    st = RuleState()
+    events = [(now, a) for value, now in series
+              if (a := _evaluate_step(st, value, now, r))]
+    return events, st
+
+
+def main():
+    failures = 0
+
+    def check(label, cond, detail=""):
+        nonlocal failures
+        print(("PASS" if cond else "FAIL"), label, detail)
+        if not cond:
+            failures += 1
+
+    # 1) sustained exceedance -> onset after duration; recovery -> clear after duration
+    r = rule(threshold_db=85, duration_s=3, clear_margin_db=2)
+    ev, _ = run([(80, 0), (86, 1), (87, 2), (88, 3), (88, 4),
+                 (88, 5), (82, 6), (82, 7), (82, 8), (82, 9)], r)
+    onsets = [t for t, a in ev if a == "onset"]
+    clears = [t for t, a in ev if a == "clear"]
+    check("1 sustained onset@4 / clear@9", onsets == [4] and clears == [9], str(ev))
+
+    # 2) brief spike under duration -> no onset (debounce)
+    ev, _ = run([(80, 0), (90, 1), (90, 2), (80, 3), (80, 4)], rule(duration_s=3))
+    check("2 brief spike debounced", ev == [], str(ev))
+
+    # 3) hysteresis: a dip into the margin (below threshold, above threshold-margin)
+    #    does NOT clear
+    r = rule(threshold_db=85, duration_s=0, clear_margin_db=3)
+    ev, st = run([(86, 0), (84, 1), (84, 2), (84, 3)], r)
+    check("3 hysteresis holds ACTIVE", ev == [(0, "onset")] and st.phase == "active",
+          f"{ev} phase={st.phase}")
+
+    # 4) 'below' comparison (device too quiet) -> onset when value < threshold
+    ev, _ = run([(30, 0), (15, 1)], rule(threshold_db=20, duration_s=0,
+                                         clear_margin_db=2, comparison="below"))
+    check("4 below-comparison onset@1", ev == [(1, "onset")], str(ev))
+
+    print()
+    print("ALL PASS" if failures == 0 else f"{failures} FAILURE(S)")
+    return failures
+
+
+if __name__ == "__main__":
+    import sys
+    sys.exit(1 if main() else 0)
@@ -1,128 +0,0 @@
-#!/usr/bin/env python3
-"""
-Test script to verify that sleep mode is automatically disabled when:
-1. Device configuration is created/updated with TCP enabled
-2. Measurements are started
-
-This script tests the API endpoints, not the actual device communication.
-"""
-
-import requests
-import json
-
-BASE_URL = "http://localhost:8100/api/nl43"
-UNIT_ID = "test-nl43-001"
-
-def test_config_update():
-    """Test that config update works (actual sleep mode disable requires real device)"""
-    print("\n=== Testing Config Update ===")
-
-    # Create/update a device config
-    config_data = {
-        "host": "192.168.1.100",
-        "tcp_port": 2255,
-        "tcp_enabled": True,
-        "ftp_enabled": False,
-        "ftp_username": "admin",
-        "ftp_password": "password"
-    }
-
-    print(f"Updating config for {UNIT_ID}...")
-    response = requests.put(f"{BASE_URL}/{UNIT_ID}/config", json=config_data)
-
-    if response.status_code == 200:
-        print("✓ Config updated successfully")
-        print(f"Response: {json.dumps(response.json(), indent=2)}")
-        print("\nNote: Sleep mode disable was attempted (will succeed if device is reachable)")
-        return True
-    else:
-        print(f"✗ Config update failed: {response.status_code}")
-        print(f"Error: {response.text}")
-        return False
-
-def test_get_config():
-    """Test retrieving the config"""
-    print("\n=== Testing Get Config ===")
-
-    response = requests.get(f"{BASE_URL}/{UNIT_ID}/config")
-
-    if response.status_code == 200:
-        print("✓ Config retrieved successfully")
-        print(f"Response: {json.dumps(response.json(), indent=2)}")
-        return True
-    elif response.status_code == 404:
-        print("✗ Config not found (create one first)")
-        return False
-    else:
-        print(f"✗ Request failed: {response.status_code}")
-        print(f"Error: {response.text}")
-        return False
-
-def test_start_measurement():
-    """Test that start measurement attempts to disable sleep mode"""
-    print("\n=== Testing Start Measurement ===")
-
-    print(f"Attempting to start measurement on {UNIT_ID}...")
-    response = requests.post(f"{BASE_URL}/{UNIT_ID}/start")
-
-    if response.status_code == 200:
-        print("✓ Start command accepted")
-        print(f"Response: {json.dumps(response.json(), indent=2)}")
-        print("\nNote: Sleep mode was disabled before starting measurement")
-        return True
-    elif response.status_code == 404:
-        print("✗ Device config not found (create config first)")
-        return False
-    elif response.status_code == 502:
-        print("✗ Device not reachable (expected if no physical device)")
-        print(f"Response: {response.text}")
-        print("\nNote: This is expected behavior when testing without a physical device")
-        return True  # This is actually success - the endpoint tried to communicate
-    else:
-        print(f"✗ Request failed: {response.status_code}")
-        print(f"Error: {response.text}")
-        return False
-
-def main():
-    print("=" * 60)
-    print("Sleep Mode Auto-Disable Test")
-    print("=" * 60)
-    print("\nThis test verifies that sleep mode is automatically disabled")
-    print("when device configs are updated or measurements are started.")
-    print("\nNote: Without a physical device, some operations will fail at")
-    print("the device communication level, but the API logic will execute.")
-
-    # Run tests
-    results = []
-
-    # Test 1: Update config (should attempt to disable sleep mode)
-    results.append(("Config Update", test_config_update()))
-
-    # Test 2: Get config
-    results.append(("Get Config", test_get_config()))
-
-    # Test 3: Start measurement (should attempt to disable sleep mode)
-    results.append(("Start Measurement", test_start_measurement()))
-
-    # Summary
-    print("\n" + "=" * 60)
-    print("Test Summary")
-    print("=" * 60)
-
-    for test_name, result in results:
-        status = "✓ PASS" if result else "✗ FAIL"
-        print(f"{status}: {test_name}")
-
-    print("\n" + "=" * 60)
-    print("Implementation Details:")
-    print("=" * 60)
-    print("1. Config endpoint is now async and calls ensure_sleep_mode_disabled()")
-    print("   when TCP is enabled")
-    print("2. Start measurement endpoint calls ensure_sleep_mode_disabled()")
-    print("   before starting the measurement")
-    print("3. Sleep mode check is non-blocking - config/start will succeed")
-    print("   even if the device is unreachable")
-    print("=" * 60)
-
-if __name__ == "__main__":
-    main()