stop tracking dev runtime data

chore: gitignore clean up
chore: data-dev folder added to gitignore
2026-03-12 22:46:37 +00:00 · 2026-03-12 21:34:14 +00:00 · 2026-03-12 21:33:43 +00:00 · 2026-03-12 20:26:56 +00:00 · 2026-03-12 20:26:53 +00:00 · 2026-02-19 15:09:50 +00:00
11 changed files with 2848 additions and 320 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,5 +1,8 @@
 /manuals/
 /data/
 /data-dev/
 /SLM-stress-test/stress_test_logs/
 /SLM-stress-test/tcpdump-runs/
 # Python cache
 __pycache__/
@@ -12,3 +15,5 @@ __pycache__/
 *.egg-info/
 dist/
 build/
 *.pcap
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -5,6 +5,59 @@ All notable changes to SLMM (Sound Level Meter Manager) will be documented in th
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 ## [0.3.0] - 2026-02-17
 ### Added
 #### Persistent TCP Connection Pool
 - **Connection reuse** - TCP connections are cached per device and reused across commands, eliminating repeated TCP handshakes over cellular modems
 - **OS-level TCP keepalive** - Configurable keepalive probes keep cellular NAT tables alive and detect dead connections early (default: probe after 15s idle, every 10s, 3 failures = dead)
 - **Transparent retry** - If a cached connection goes stale, the system automatically retries with a fresh connection so failures are never visible to the caller
 - **Stale connection detection** - Multi-layer detection via idle TTL, max age, transport state, and reader EOF checks
 - **Background cleanup** - Periodic task (every 30s) evicts expired connections from the pool
 - **Master switch** - Set `TCP_PERSISTENT_ENABLED=false` to revert to per-request connection behavior
 #### Connection Pool Diagnostics
 - `GET /api/nl43/_connections/status` - View pool configuration, active connections, age/idle times, and keepalive settings
 - `POST /api/nl43/_connections/flush` - Force-close all cached connections (useful for debugging)
 - **Connections tab on roster page** - Live UI showing pool config, active connections with age/idle/alive status, auto-refreshes every 5s, and flush button
 #### Environment Variables
 - `TCP_PERSISTENT_ENABLED` (default: `true`) - Master switch for persistent connections
 - `TCP_IDLE_TTL` (default: `300`) - Close idle connections after N seconds
 - `TCP_MAX_AGE` (default: `1800`) - Force reconnect after N seconds
 - `TCP_KEEPALIVE_IDLE` (default: `15`) - Seconds idle before keepalive probes start
 - `TCP_KEEPALIVE_INTERVAL` (default: `10`) - Seconds between keepalive probes
 - `TCP_KEEPALIVE_COUNT` (default: `3`) - Failed probes before declaring connection dead
 ### Changed
 - **Health check endpoint** (`/health/devices`) - Now uses connection pool instead of opening throwaway TCP connections; checks for existing live connections first (zero-cost), only opens new connection through pool if needed
 - **Diagnostics endpoint** - Removed separate port 443 modem check (extra handshake waste); TCP reachability test now uses connection pool
 - **DRD streaming** - Streaming connections now get TCP keepalive options set; cached connections are evicted before opening dedicated streaming socket
 - **Default timeouts tuned for cellular** - Idle TTL raised to 300s (5 min), max age raised to 1800s (30 min) to survive typical polling intervals over cellular links
 ### Technical Details
 #### Architecture
 - `ConnectionPool` class in `services.py` manages a single cached connection per device key (NL-43 only supports one TCP connection at a time)
 - Uses existing per-device asyncio locks and rate limiting — no changes to concurrency model
 - Pool is a module-level singleton initialized from environment variables at import time
 - Lifecycle managed via FastAPI lifespan: cleanup task starts on startup, all connections closed on shutdown
 - `_send_command_unlocked()` refactored to use acquire/release/discard pattern with single-retry fallback
 - Command parsing extracted to `_execute_command()` method for reuse between primary and retry paths
 #### Cellular Modem Optimizations
 - Keepalive probes at 15s prevent cellular NAT tables from expiring (typically 30-60s timeout)
 - 300s idle TTL ensures connections survive between polling cycles (default 60s interval)
 - 1800s max age allows a single socket to serve ~30 minutes of polling before forced reconnect
 - Health checks and diagnostics produce zero additional TCP handshakes when a pooled connection exists
 - Stale `$` prompt bytes drained from idle connections before command reuse
 ### Breaking Changes
 None. This release is fully backward-compatible with v0.2.x. Set `TCP_PERSISTENT_ENABLED=false` for identical behavior to previous versions.
 ---
 ## [0.2.1] - 2026-01-23
 ### Added
@@ -146,6 +199,7 @@ None. This release is fully backward-compatible with v0.1.x. All existing endpoi
 ## Version History Summary
 - **v0.3.0** (2026-02-17) - Persistent TCP connections with keepalive for cellular modem reliability
 - **v0.2.1** (2026-01-23) - Roster management, scheduler hooks, FTP logging, doc cleanup
 - **v0.2.0** (2026-01-15) - Background Polling System
 - **v0.1.0** (2025-12-XX) - Initial Release
--- a/README.md
+++ b/README.md
@@ -1,6 +1,6 @@
 # SLMM - Sound Level Meter Manager
-**Version 0.2.1**
+**Version 0.3.0**
 Backend API service for controlling and monitoring Rion NL-43/NL-53 Sound Level Meters via TCP and FTP protocols.
@@ -12,8 +12,9 @@ SLMM is a standalone backend module that provides REST API routing and command t
 ## Features
- **Background Polling** ⭐ NEW: Continuous automatic polling of devices with configurable intervals
+- **Persistent TCP Connections**: Cached per-device connections with OS-level keepalive, tuned for cellular modem reliability
- **Offline Detection** ⭐ NEW: Automatic device reachability tracking with failure counters
+- **Background Polling**: Continuous automatic polling of devices with configurable intervals
 - **Offline Detection**: Automatic device reachability tracking with failure counters
 - **Device Management**: Configure and manage multiple NL43/NL53 devices
 - **Real-time Monitoring**: Stream live measurement data via WebSocket
 - **Measurement Control**: Start, stop, pause, resume, and reset measurements
@@ -22,6 +23,7 @@ SLMM is a standalone backend module that provides REST API routing and command t
 - **Device Configuration**: Manage frequency/time weighting, clock sync, and more
 - **Rate Limiting**: Automatic 1-second delay enforcement between device commands
 - **Persistent Storage**: SQLite database for device configs and measurement cache
 - **Connection Diagnostics**: Live UI and API endpoints for monitoring TCP connection pool status
 ## Architecture
@@ -29,29 +31,39 @@ SLMM is a standalone backend module that provides REST API routing and command t
 ┌─────────────────┐         ┌──────────────────────────────┐         ┌─────────────────┐
 │                 │◄───────►│  SLMM API                    │◄───────►│  NL43/NL53      │
 │  (Frontend)     │  HTTP   │  • REST Endpoints            │  TCP    │  Sound Meters   │
-└─────────────────┘         │  • WebSocket Streaming       │         └─────────────────┘
+└─────────────────┘         │  • WebSocket Streaming       │  (kept  │  (via cellular  │
-                            │  • Background Poller ⭐ NEW  │                ▲
+                            │  • Background Poller         │  alive) │   modem)        │
-                            └──────────────────────────────┘                │
+                            │  • Connection Pool (v0.3)    │         └─────────────────┘
-                                          │                         Continuous
+                            └──────────────────────────────┘
-                                          ▼                          Polling
+                                          │
-                                  ┌──────────────┐                      │
+                                          ▼
-                                  │  SQLite DB   │◄─────────────────────┘
+                                  ┌──────────────┐
                                  │  SQLite DB   │
                                  │  • Config    │
                                  │  • Status    │
                                  └──────────────┘
 ```
 ### Persistent TCP Connection Pool (v0.3.0)
 SLMM maintains persistent TCP connections to devices with OS-level keepalive, designed for reliable operation over cellular modems:
 - **Connection Reuse**: One cached TCP socket per device, reused across all commands (no repeated handshakes)
 - **TCP Keepalive**: Probes keep cellular NAT tables alive and detect dead connections early
 - **Transparent Retry**: Stale cached connections automatically retry with a fresh socket
 - **Configurable**: Idle TTL (300s), max age (1800s), and keepalive timing via environment variables
 - **Diagnostics**: Live UI on the roster page and API endpoints for monitoring pool status
 ### Background Polling (v0.2.0)
-SLMM now includes a background polling service that continuously queries devices and updates the status cache:
+Background polling service continuously queries devices and updates the status cache:
 - **Automatic Updates**: Devices are polled at configurable intervals (10-3600 seconds)
 - **Offline Detection**: Devices marked unreachable after 3 consecutive failures
 - **Per-Device Configuration**: Each device can have a custom polling interval
 - **Resource Efficient**: Dynamic sleep intervals and smart scheduling
 - **Graceful Shutdown**: Background task stops cleanly on service shutdown
-This makes Terra-View significantly more responsive - status requests return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds).
+Status requests return cached data instantly (<100ms) instead of waiting for device queries (1-2 seconds).
 ## Quick Start
@@ -96,9 +108,18 @@ Once running, visit:
 ### Environment Variables
 **Server:**
 - `PORT`: Server port (default: 8100)
 - `CORS_ORIGINS`: Comma-separated list of allowed origins (default: "*")
 **TCP Connection Pool:**
 - `TCP_PERSISTENT_ENABLED`: Enable persistent connections (default: "true")
 - `TCP_IDLE_TTL`: Close idle connections after N seconds (default: 300)
 - `TCP_MAX_AGE`: Force reconnect after N seconds (default: 1800)
 - `TCP_KEEPALIVE_IDLE`: Seconds idle before keepalive probes (default: 15)
 - `TCP_KEEPALIVE_INTERVAL`: Seconds between keepalive probes (default: 10)
 - `TCP_KEEPALIVE_COUNT`: Failed probes before declaring dead (default: 3)
 ### Database
 The SQLite database is automatically created at [data/slmm.db](data/slmm.db) on first run.
@@ -126,7 +147,7 @@ Logs are written to:
 | GET | `/api/nl43/{unit_id}/live` | Request fresh DOD data from device (bypasses cache) |
 | WS | `/api/nl43/{unit_id}/stream` | WebSocket stream for real-time DRD data |
-### Background Polling Configuration ⭐ NEW
+### Background Polling
 | Method | Endpoint | Description |
 |--------|----------|-------------|
@@ -134,6 +155,13 @@ Logs are written to:
 | PUT | `/api/nl43/{unit_id}/polling/config` | Update polling interval and enable/disable polling |
 | GET | `/api/nl43/_polling/status` | Get global polling status for all devices |
 ### Connection Pool
 | Method | Endpoint | Description |
 |--------|----------|-------------|
 | GET | `/api/nl43/_connections/status` | Get pool config, active connections, age/idle times |
 | POST | `/api/nl43/_connections/flush` | Force-close all cached TCP connections |
 ### Measurement Control
 | Method | Endpoint | Description |
@@ -255,6 +283,9 @@ Caches latest measurement snapshot:
 ### TCP Communication
 - Uses ASCII command protocol over TCP
 - Persistent connections with OS-level keepalive (tuned for cellular modems)
 - Connections cached per device and reused across commands
 - Transparent retry on stale connections
 - Enforces ≥1 second delay between commands to same device
 - Two-line response format:
  - Line 1: Result code (R+0000 for success)
@@ -320,6 +351,16 @@ curl http://localhost:8100/api/nl43/meter-001/polling/config
 curl http://localhost:8100/api/nl43/_polling/status
 ```
 ### Check Connection Pool Status
 ```bash
 curl http://localhost:8100/api/nl43/_connections/status | jq '.'
 ```
 ### Flush All Cached Connections
 ```bash
 curl -X POST http://localhost:8100/api/nl43/_connections/flush
 ```
 ### Verify Device Settings
 ```bash
 curl http://localhost:8100/api/nl43/meter-001/settings
@@ -388,11 +429,19 @@ See [API.md](API.md) for detailed integration examples.
 ## Troubleshooting
 ### Connection Issues
 - Check connection pool status: `curl http://localhost:8100/api/nl43/_connections/status`
 - Flush stale connections: `curl -X POST http://localhost:8100/api/nl43/_connections/flush`
 - Verify device IP address and port in configuration
 - Ensure device is on the same network
 - Check firewall rules allow TCP/FTP connections
 - Verify RX55 network adapter is properly configured on device
 ### Cellular Modem Issues
 - If modem wedges from too many handshakes, ensure `TCP_PERSISTENT_ENABLED=true` (default)
 - Increase `TCP_IDLE_TTL` if connections expire between poll cycles
 - Keepalive probes (default: every 15s) keep NAT tables alive — adjust `TCP_KEEPALIVE_IDLE` if needed
 - Set `TCP_PERSISTENT_ENABLED=false` to disable pooling for debugging
 ### Rate Limiting
 - API automatically enforces 1-second delay between commands
 - If experiencing delays, this is normal device behavior
--- a/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md
+++ b/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md
@@ -0,0 +1,403 @@
 # NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps
 **Last updated:** 2026-02-18  
 **Owner:** Brian / serversdown  
 **Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55
 ---
 ## 0) What this document is
 This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43’s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable.
 This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**.
 ---
 ## 1) Architecture (as tested)
 ### Network path
 - **Server (SLMM host):** `10.0.0.40`
 - **RX55 WAN IP:** `63.45.161.30`
 - **RX55 LAN subnet:** `192.168.1.0/24`
 - **RX55 LAN gateway:** `192.168.1.1`
 - **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation)
 ### RX55 details
 - **Sierra Wireless RX55**
 - **OS:** 5.2
 - **Firmware:** `01.14.24.00`
 - **Carrier:** Verizon LTE (Band 66)
 ### Port forwarding rules (RX55)
 - **WAN:2255 → NL-43:2255**  (NL-43 TCP control)
 - **WAN:21   → NL-43:21**    (NL-43 FTP control)
 You also experimented with additional forwards:
 - **WAN:2253 → NL-43:2255** (test)
 - **WAN:2253 → NL-43:2253** (test)
 - **WAN:4450 → NL-43:4450** (test)
 **Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x.
 ---
 ## 2) Original problem statement (the “wedge”)
 After running for hours, the NL-43 becomes unreachable over TCP control.
 ### Symptom signature (WAN-side)
 - Client attempts to connect to `63.45.161.30:2255`
 - Instead of timing out, the client gets **connection refused** quickly.
 - Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop)
 ### Critical operational behavior
 - **Power cycling the NL-43 fixes it.**
 - **Power cycling the RX55 does NOT fix it.**
 - FTP sometimes remains available even while TCP control (2255) is dead.
 This combination is what forced us to determine whether:
 - The RX55 is rejecting connections, OR
 - The NL-43 is no longer listening on 2255, OR
 - Something about the RX55 path triggers the NL-43’s control listener to die.
 ---
 ## 3) Event timeline evidence (SLMM logs)
 A concrete wedge window was observed on **2026-02-18**:
 - 10:55:46 AM — Poll success (Start)
 - 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded)
 - 11:55:50 AM — Poll success (Stop)
 - 12:55:55 PM — Poll success (Stop)
 - **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)**
 - 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused)
 Key interpretation:
 - The wedge occurred sometime between **12:55 and 1:55**.
 - The failure type is **refused**, not timeout.
 ---
 ## 4) Early hypotheses (before proof)
 We considered two main buckets:
 ### A) NL-43-side failure (most suspicious)
 - NL-43 TCP control service crashes / exits / unbinds from 2255
 - socket leak / accept backlog exhaustion
 - “single control session allowed” and it gets stuck thinking a session is active
 - mode/service manager bug (service restart fails after other activities)
 - firmware bug in TCP daemon
 ### B) RX55-side failure (possible trigger / less likely once FTP works)
 - NAT/forwarding table corruption
 - firewall behavior
 - helper/ALG interference
 - MSS/MTU weirdness causing edge-case behavior
 - session churn behavior causing downstream issues
 ---
 ## 5) Key experiments and what they proved
 ### 5.1) LAN-only stability test (No RX55 path)
 **Test:** NL-43 tested directly on LAN (no modem path involved).
 - Ran **24+ hours**
 - Scheduler start/stop cycles worked
 - Stress test: **500 commands @ 1/sec** → no failure
 - Response time trend decreased (not degrading)
 **Result:** The NL-43 appears stable in a “pure LAN” environment.
 **Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime.
 ---
 ### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization)
 You observed:
 - **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop)
 - **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST)
 Concrete example:
 - Port **4450** with no rule → timeout
 - Port **4450 → NL-43:4450** rule created → connection refused
 **Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target.
 Important nuance:
 - A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop.
 ---
 ### 5.3) UDP test confusion (and resolution)
 You ran:
 ```bash
 nc -vzu 63.45.161.30 2255
 nc -vz  63.45.161.30 2255
 ```
 Observed:
 - UDP: “succeeded”
 - TCP: “connection refused”
 Resolution:
 - UDP has **no handshake**. netcat prints “succeeded” if it doesn’t immediately receive an ICMP unreachable. It does **not** mean a UDP service exists.
 - TCP refused is meaningful: a RST implies “no listener” or “actively rejected.”
 **Net effect:** UDP test did not change the diagnosis.
 ---
 ### 5.4) Packet capture proof (WAN-side)
 You captured a Wireshark/tcpdump summary with these key patterns:
 #### Port 2255 (TCP control)
 Example:
 - `10.0.0.40 → 63.45.161.30:2255` SYN
 - `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms
 This happened repeatedly.
 #### Port 2253 (test port)
 Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment).
 #### Port 21 (FTP)
 Clean 3-way handshake:
 - SYN → SYN/ACK → ACK
 Then:
 - FTP server banner: `220 Connection Ready`
 Then:
 - `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment)
 Session closes cleanly.
 **Key takeaway from capture:**
 - TCP transport to NL-43 via RX55 is definitely working (port 21 proves it).
 - Port 2255 is being actively refused.
 This strongly suggested “2255 listener is gone,” but still didn’t fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43.
 ---
 ## 6) The decisive experiment: LAN-side test while wedged (final proof)
 Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**.
 ### 6.1) Physical LAN tap setup
 Constraint:
 - NL-43 has only one Ethernet port.
 Solution:
 - Insert an unmanaged switch:
  - RX55 LAN → switch
  - NL-43 → switch
  - Windows 10 laptop → switch
 This creates a shared L2 segment where the laptop can test NL-43 directly.
 ### 6.2) Windows LAN validation
 On the Windows laptop:
 - `ipconfig` showed:
  - IP: `192.168.1.100`
  - Gateway: `192.168.1.1` (RX55)
 - Initial `arp -a` only showed RX55, not NL-43.
 You then:
 - pinged likely host addresses and discovered NL-43 responds on **192.168.1.10**
 - `arp -a` then showed:
  - `192.168.1.10 → 00-10-50-14-0a-d8`
  - OUI `00-10-50` recognized as **Rion** (matches NL-43)
 So LAN identities were confirmed:
 - RX55: `192.168.1.1`
 - NL-43: `192.168.1.10`
 ### 6.3) The LAN port tests (the smoking gun)
 From Windows:
 ```powershell
 Test-NetConnection -ComputerName 192.168.1.10 -Port 2255
 Test-NetConnection -ComputerName 192.168.1.10 -Port 21
 ```
 Results (while the unit was “wedged” from the WAN perspective):
 - **2255:** `TcpTestSucceeded : False`
 - **21:**   `TcpTestSucceeded : True`
 **Conclusion (PROVEN):**
 - The NL-43 is reachable on the LAN
 - FTP port 21 is alive
 - **The NL-43 is NOT listening on TCP port 2255**
 - Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255.
 This is now settled.
 ---
 ## 7) What we learned (final conclusions)
 ### 7.1) RX55 innocence (for this failure mode)
 The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared.
 It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*.
 ### 7.2) NL-43 control listener failure
 The NL-43’s TCP control service (port 2255) stops listening while:
 - the device remains alive
 - the LAN stack remains alive (ping)
 - FTP remains alive (port 21)
 This looks like one of:
 - control daemon crash/exit
 - service unbind
 - stuck service state (e.g., “busy” / “session active forever”)
 - resource leak (sockets/file descriptors) specific to the control service
 - firmware service manager bug (start/stop of services fails after certain sequences)
 ---
 ## 8) Additional constraint discovered: “Web App mode” conflicts
 You noted an important operational constraint:
 > Turning on the web app disables other interfaces like TCP and FTP.
 Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because:
 - If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle.
 - It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed.
 We have not yet run a controlled long test to determine whether:
 - mode switching contributes directly to the 2255 listener dying, OR
 - it happens even in a pure TCP-only mode with no switching.
 ---
 ## 9) Immediate operational decision (field tomorrow)
 Because the device is needed in the field immediately, you chose:
 - **Old-school manual deployment**
 - **Manual SD card downloads**
 - Avoid reliance on 2255/TCP control and remote workflows for now.
 **Important operational note:**
 The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency.
 ---
 ## 10) What’s next (future work — when the unit is back)
 Because long tests can’t be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation.
 ### 10.1) Controlled experiment matrix (recommended)
 Run each test for 24–72 hours, or until wedge occurs, and record:
 - number of TCP connects
 - whether connections are persistent
 - whether FTP is used
 - whether any mode toggling is performed
 - time-to-wedge
 #### Test A — TCP-only (ideal baseline)
 - TCP control only (2255)
 - **True persistent connection** (open once, keep forever)
 - No FTP
 - No web mode toggling
 Outcome interpretation:
 - If stable: connection churn and/or FTP/mode switching is the trigger.
 - If wedges anyway: pure 2255 daemon leak/bug.
 #### Test B — TCP with connection churn
 - Same as A but intentionally reconnect on a schedule (current SLMM behavior)
 - No FTP
 Outcome:
 - If this wedges but A doesn’t: churn is the trigger.
 #### Test C — FTP activity + TCP
 - Introduce scheduled FTP sessions (downloads) while using TCP control
 - Observe whether wedge correlates with FTP use or with post-download periods.
 Outcome:
 - If wedge correlates with FTP, suspect internal service lifecycle conflict.
 #### Test D — Web mode interaction (only if safe/possible)
 - Evaluate what toggling web mode does to TCP/FTP services.
 - Determine if any remote-safe “soft reset” exists.
 ---
 ## 11) Mitigation options (ranked)
 ### Option 1 — Make SLMM truly persistent (highest probability of success)
 If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is:
 - Open one TCP socket per device
 - Keep it open indefinitely
 - Use OS keepalive
 - Do **not** rotate connections on timers
 - Reconnect only when the socket actually dies
 This reduces:
 - connect/close cycles
 - NAT edge-case exposure
 - resource churn inside NL-43
 ### Option 2 — Service “soft reset” (if possible without disabling required services)
 If there exists any way to restart the 2255 service without power cycling:
 - LAN TCP toggle (if it doesn’t require web mode)
 - any “restart comms” command (unknown)
 - any maintenance menu sequence
 then SLMM could:
 - detect wedge
 - trigger soft reset
 - recover automatically
 Current constraint: web app mode appears to disable other services, so this may not be viable.
 ### Option 3 — Hardware watchdog power cycle (industrial but reliable)
 If this is a firmware bug with no clean workaround:
 - Add a remotely controlled relay/power switch
 - On wedge detection, power-cycle NL-43 automatically
 - Optionally schedule a nightly power cycle to prevent leak accumulation
 This is “field reality” and often the only long-term move with embedded devices.
 ### Option 4 — Vendor escalation (Rion)
 You now have excellent evidence:
 - LAN-side proof: 2255 dead while 21 alive
 - WAN packet evidence
 - clear isolation of RX55 innocence
 This is strong enough to send to Rion support as a firmware defect report.
 ---
 ## 12) Repro “wedge bundle” checklist (for future captures)
 When the wedge happens again, capture these before power cycling:
 1) From server:
 - `nc -vz 63.45.161.30 2255` (expect refused)
 - `nc -vz 63.45.161.30 21`   (expect success if FTP alive)
 2) From LAN side (via switch/laptop):
 - `Test-NetConnection 192.168.1.10 -Port 2255`
 - `Test-NetConnection 192.168.1.10 -Port 21`
 3) Optional: packet capture around the refused attempt.
 4) Record:
 - last successful poll timestamp
 - last FTP session timestamp
 - any scheduled start/stop/download cycles near wedge time
 - SLMM connection reuse/rotation settings in effect
 ---
 ## 13) Final, current-state summary (as of 2026-02-18)
 - The issue is **NOT** the RX55 rejecting inbound connections.
 - The NL-43 is **alive**, reachable on LAN, and FTP works.
 - The NL-43’s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy.
 - The wedge can occur hours after successful operations.
 - The unit is needed in the field immediately, so investigation pauses.
 - Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset).
 ---
 ## 14) Notes / misc observations
 - The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMM’s “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow.
 - UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned.
 ---
 **End of document.**
--- a/SLM-stress-test/nl43_stress_test.py
+++ b/SLM-stress-test/nl43_stress_test.py
--- a/app/background_poller.py
+++ b/app/background_poller.py
@@ -38,6 +38,7 @@ class BackgroundPoller:
        self._running = False
        self._logger = logger
        self._last_cleanup = None  # Track last log cleanup time
        self._last_pool_log = None  # Track last connection pool heartbeat log
    async def start(self):
        """Start the background polling task."""
@@ -89,6 +90,24 @@ class BackgroundPoller:
            except Exception as e:
                self._logger.warning(f"Log cleanup failed: {e}")
            # Log connection pool status every 15 minutes
            try:
                now = datetime.utcnow()
                if self._last_pool_log is None or (now - self._last_pool_log).total_seconds() > 900:
                    from app.services import _connection_pool
                    stats = _connection_pool.get_stats()
                    conns = stats.get("connections", {})
                    if conns:
                        for key, c in conns.items():
                            self._logger.info(
                                f"[POOL] {key} — age={c['age_seconds']}s idle={c['idle_seconds']}s alive={c['alive']}"
                            )
                    else:
                        self._logger.info("[POOL] No active connections in pool")
                    self._last_pool_log = now
            except Exception as e:
                self._logger.warning(f"Pool status log failed: {e}")
            # Calculate dynamic sleep interval
            sleep_time = self._calculate_sleep_interval()
            self._logger.debug(f"Sleeping for {sleep_time} seconds until next poll cycle")
--- a/app/main.py
+++ b/app/main.py
@@ -29,7 +29,11 @@ logger.info("Database tables initialized")
@asynccontextmanager
 async def lifespan(app: FastAPI):
    """Manage application lifecycle - startup and shutdown events."""
    from app.services import _connection_pool
    # Startup
    logger.info("Starting TCP connection pool cleanup task...")
    _connection_pool.start_cleanup()
    logger.info("Starting background poller...")
    await poller.start()
    logger.info("Background poller started")
@@ -40,12 +44,15 @@ async def lifespan(app: FastAPI):
    logger.info("Stopping background poller...")
    await poller.stop()
    logger.info("Background poller stopped")
    logger.info("Closing TCP connection pool...")
    await _connection_pool.close_all()
    logger.info("TCP connection pool closed")
 app = FastAPI(
    title="SLMM NL43 Addon",
    description="Standalone module for NL43 configuration and status APIs with background polling",
-    version="0.2.0",
+    version="0.3.0",
    lifespan=lifespan,
 )
@@ -85,10 +92,14 @@ async def health():
@app.get("/health/devices")
 async def health_devices():
-    """Enhanced health check that tests device connectivity."""
+    """Enhanced health check that tests device connectivity.
    Uses the connection pool to avoid unnecessary TCP handshakes — if a
    cached connection exists and is alive, the device is reachable.
    """
    from sqlalchemy.orm import Session
    from app.database import SessionLocal
-    from app.services import NL43Client
+    from app.services import _connection_pool
    from app.models import NL43Config
    db: Session = SessionLocal()
@@ -98,7 +109,7 @@ async def health_devices():
        configs = db.query(NL43Config).filter_by(tcp_enabled=True).all()
        for cfg in configs:
-            client = NL43Client(cfg.host, cfg.tcp_port, timeout=2.0, ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password)
+            device_key = f"{cfg.host}:{cfg.tcp_port}"
            status = {
                "unit_id": cfg.unit_id,
                "host": cfg.host,
@@ -108,14 +119,22 @@ async def health_devices():
            }
            try:
-                # Try to connect (don't send command to avoid rate limiting issues)
+                # Check if pool already has a live connection (zero-cost check)
-                import asyncio
+                pool_stats = _connection_pool.get_stats()
-                reader, writer = await asyncio.wait_for(
+                conn_info = pool_stats["connections"].get(device_key)
-                    asyncio.open_connection(cfg.host, cfg.tcp_port), timeout=2.0
+                if conn_info and conn_info["alive"]:
                )
                writer.close()
                await writer.wait_closed()
                    status["reachable"] = True
                    status["source"] = "pool"
                else:
                    # No cached connection — do a lightweight acquire/release
                    # This opens a connection if needed but keeps it in the pool
                    import asyncio
                    reader, writer, from_cache = await _connection_pool.acquire(
                        device_key, cfg.host, cfg.tcp_port, timeout=2.0
                    )
                    await _connection_pool.release(device_key, reader, writer, cfg.host, cfg.tcp_port)
                    status["reachable"] = True
                    status["source"] = "cached" if from_cache else "new"
            except Exception as e:
                status["error"] = str(type(e).__name__)
                logger.warning(f"Device {cfg.unit_id} health check failed: {e}")
--- a/app/routers.py
+++ b/app/routers.py
@@ -93,6 +93,34 @@ class PollingConfigPayload(BaseModel):
    poll_enabled: bool | None = Field(None, description="Enable or disable background polling for this device")
 # ============================================================================
 # TCP CONNECTION POOL ENDPOINTS (must be before /{unit_id} routes)
 # ============================================================================
@router.get("/_connections/status")
 async def get_connection_pool_status():
    """Get status of the persistent TCP connection pool.
    Returns information about cached connections, keepalive settings,
    and per-device connection age/idle times.
    """
    from app.services import _connection_pool
    return {"status": "ok", "pool": _connection_pool.get_stats()}
@router.post("/_connections/flush")
 async def flush_connection_pool():
    """Close all cached TCP connections.
    Useful for debugging or forcing fresh connections to all devices.
    """
    from app.services import _connection_pool
    await _connection_pool.close_all()
    # Restart cleanup task since close_all cancels it
    _connection_pool.start_cleanup()
    return {"status": "ok", "message": "All cached connections closed"}
 # ============================================================================
 # GLOBAL POLLING STATUS ENDPOINT (must be before /{unit_id} routes)
 # ============================================================================
@@ -545,12 +573,6 @@ async def stop_measurement(unit_id: str, db: Session = Depends(get_db)):
    try:
        await client.stop()
        logger.info(f"Stopped measurement on unit {unit_id}")
        # Query device status to update database with "Stop" state
        snap = await client.request_dod()
        snap.unit_id = unit_id
        persist_snapshot(snap, db)
    except ConnectionError as e:
        logger.error(f"Failed to stop measurement on {unit_id}: {e}")
        raise HTTPException(status_code=502, detail="Failed to communicate with device")
@@ -560,6 +582,15 @@ async def stop_measurement(unit_id: str, db: Session = Depends(get_db)):
    except Exception as e:
        logger.error(f"Unexpected error stopping measurement on {unit_id}: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")
    # Query device status to update database — non-fatal if this fails
    try:
        snap = await client.request_dod()
        snap.unit_id = unit_id
        persist_snapshot(snap, db)
    except Exception as e:
        logger.warning(f"Stop succeeded but failed to update status for {unit_id}: {e}")
    return {"status": "ok", "message": "Measurement stopped"}
@@ -657,8 +688,9 @@ async def stop_cycle(unit_id: str, payload: StopCyclePayload = None, db: Session
        return {"status": "ok", "unit_id": unit_id, **result}
    except Exception as e:
-        logger.error(f"Stop cycle failed for {unit_id}: {e}")
+        error_msg = str(e) if str(e) else f"{type(e).__name__}: No details available"
-        raise HTTPException(status_code=502, detail=str(e))
+        logger.error(f"Stop cycle failed for {unit_id}: {error_msg}")
        raise HTTPException(status_code=502, detail=error_msg)
@router.post("/{unit_id}/store")
@@ -1723,74 +1755,38 @@ async def run_diagnostics(unit_id: str, db: Session = Depends(get_db)):
        "message": "TCP communication enabled"
    }
-    # Test 3: Modem/Router reachable (check port 443 HTTPS)
+    # Test 3: TCP connection reachable (device port) — uses connection pool
    # This avoids extra TCP handshakes over cellular. If a cached connection
    # exists and is alive, we skip the handshake entirely.
    from app.services import _connection_pool
    device_key = f"{cfg.host}:{cfg.tcp_port}"
    try:
-        reader, writer = await asyncio.wait_for(
+        pool_stats = _connection_pool.get_stats()
-            asyncio.open_connection(cfg.host, 443), timeout=3.0
+        conn_info = pool_stats["connections"].get(device_key)
-        )
+        if conn_info and conn_info["alive"]:
-        writer.close()
+            # Pool already has a live connection — device is reachable
-        await writer.wait_closed()
+            diagnostics["tests"]["tcp_connection"] = {
        diagnostics["tests"]["modem_reachable"] = {
                "status": "pass",
-            "message": f"Modem/router reachable at {cfg.host}"
+                "message": f"TCP connection alive in pool for {cfg.host}:{cfg.tcp_port}"
            }
-    except asyncio.TimeoutError:
+        else:
-        diagnostics["tests"]["modem_reachable"] = {
+            # Acquire through the pool (opens new if needed, keeps it cached)
-            "status": "fail",
+            reader, writer, from_cache = await _connection_pool.acquire(
-            "message": f"Modem/router timeout at {cfg.host} (network issue)"
+                device_key, cfg.host, cfg.tcp_port, timeout=3.0
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics
    except ConnectionRefusedError:
        # Connection refused means host is up but port 443 closed - that's ok
        diagnostics["tests"]["modem_reachable"] = {
            "status": "pass",
            "message": f"Modem/router reachable at {cfg.host} (HTTPS closed)"
        }
    except Exception as e:
        diagnostics["tests"]["modem_reachable"] = {
            "status": "fail",
            "message": f"Cannot reach modem/router at {cfg.host}: {str(e)}"
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics
    # Test 4: TCP connection reachable (device port)
    try:
        reader, writer = await asyncio.wait_for(
            asyncio.open_connection(cfg.host, cfg.tcp_port), timeout=3.0
            )
-        writer.close()
+            await _connection_pool.release(device_key, reader, writer, cfg.host, cfg.tcp_port)
        await writer.wait_closed()
            diagnostics["tests"]["tcp_connection"] = {
                "status": "pass",
                "message": f"TCP connection successful to {cfg.host}:{cfg.tcp_port}"
            }
    except asyncio.TimeoutError:
        diagnostics["tests"]["tcp_connection"] = {
            "status": "fail",
            "message": f"Connection timeout to {cfg.host}:{cfg.tcp_port}"
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics
    except ConnectionRefusedError:
        diagnostics["tests"]["tcp_connection"] = {
            "status": "fail",
            "message": f"Connection refused by {cfg.host}:{cfg.tcp_port}"
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics
    except Exception as e:
        diagnostics["tests"]["tcp_connection"] = {
            "status": "fail",
-            "message": f"Connection error: {str(e)}"
+            "message": f"Connection error to {cfg.host}:{cfg.tcp_port}: {str(e)}"
        }
        diagnostics["overall_status"] = "fail"
        return diagnostics
    # Wait a bit after connection test to let device settle
    await asyncio.sleep(1.5)
    # Test 5: Device responds to commands
    # Use longer timeout to account for rate limiting (device requires ≥1s between commands)
    client = NL43Client(cfg.host, cfg.tcp_port, timeout=10.0, ftp_username=cfg.ftp_username, ftp_password=cfg.ftp_password)
--- a/app/services.py
+++ b/app/services.py
@@ -1,20 +1,22 @@
 """
 NL43 TCP connector and snapshot persistence.
-Implements simple per-request TCP calls to avoid long-lived socket complexity.
+Implements persistent per-device TCP connections with OS-level keepalive
-Extend to pooled connections/DRD streaming later.
+to reduce handshake overhead and survive cellular modem NAT timeouts.
 Falls back to per-request connections on error with transparent retry.
 """
 import asyncio
 import contextlib
 import logging
 import socket
 import time
 import os
 import zipfile
 import tempfile
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 from datetime import datetime, timezone, timedelta
-from typing import Optional, List, Dict
+from typing import Optional, List, Dict, Tuple
 from sqlalchemy.orm import Session
 from ftplib import FTP
 from pathlib import Path
@@ -234,6 +236,293 @@ async def _get_device_lock(device_key: str) -> asyncio.Lock:
        return _device_locks[device_key]
 # ---------------------------------------------------------------------------
 # Persistent TCP connection pool with OS-level keepalive
 # ---------------------------------------------------------------------------
 # Configuration via environment variables
 TCP_PERSISTENT_ENABLED = os.getenv("TCP_PERSISTENT_ENABLED", "true").lower() == "true"
 TCP_IDLE_TTL = float(os.getenv("TCP_IDLE_TTL", "300"))           # Close idle connections after N seconds
 TCP_MAX_AGE = float(os.getenv("TCP_MAX_AGE", "1800"))            # Force reconnect after N seconds
 TCP_KEEPALIVE_IDLE = int(os.getenv("TCP_KEEPALIVE_IDLE", "15"))  # Seconds idle before probes
 TCP_KEEPALIVE_INTERVAL = int(os.getenv("TCP_KEEPALIVE_INTERVAL", "10"))  # Seconds between probes
 TCP_KEEPALIVE_COUNT = int(os.getenv("TCP_KEEPALIVE_COUNT", "3"))  # Failed probes before dead
 logger.info(
    f"TCP connection pool: persistent={TCP_PERSISTENT_ENABLED}, "
    f"idle_ttl={TCP_IDLE_TTL}s, max_age={TCP_MAX_AGE}s, "
    f"keepalive_idle={TCP_KEEPALIVE_IDLE}s, keepalive_interval={TCP_KEEPALIVE_INTERVAL}s, "
    f"keepalive_count={TCP_KEEPALIVE_COUNT}"
 )
@dataclass
 class DeviceConnection:
    """Tracks a cached TCP connection and its metadata."""
    reader: asyncio.StreamReader
    writer: asyncio.StreamWriter
    device_key: str
    host: str
    port: int
    created_at: float = field(default_factory=time.time)
    last_used_at: float = field(default_factory=time.time)
 class ConnectionPool:
    """Per-device persistent TCP connection cache with OS-level keepalive.
    Each NL-43 device supports only one TCP connection at a time. This pool
    caches that single connection per device key and reuses it across commands,
    avoiding repeated TCP handshakes over high-latency cellular links.
    Keepalive probes keep cellular NAT tables alive and detect dead connections
    before the next command attempt.
    """
    def __init__(
        self,
        enable_persistent: bool = True,
        idle_ttl: float = 120.0,
        max_age: float = 300.0,
        keepalive_idle: int = 15,
        keepalive_interval: int = 10,
        keepalive_count: int = 3,
    ):
        self._connections: Dict[str, DeviceConnection] = {}
        self._lock = asyncio.Lock()
        self._enable_persistent = enable_persistent
        self._idle_ttl = idle_ttl
        self._max_age = max_age
        self._keepalive_idle = keepalive_idle
        self._keepalive_interval = keepalive_interval
        self._keepalive_count = keepalive_count
        self._cleanup_task: Optional[asyncio.Task] = None
    # -- lifecycle ----------------------------------------------------------
    def start_cleanup(self):
        """Start background task that evicts stale connections."""
        if self._enable_persistent and self._cleanup_task is None:
            self._cleanup_task = asyncio.create_task(self._cleanup_loop())
            logger.info("Connection pool cleanup task started")
    async def close_all(self):
        """Close all cached connections (called at shutdown)."""
        if self._cleanup_task is not None:
            self._cleanup_task.cancel()
            with contextlib.suppress(asyncio.CancelledError):
                await self._cleanup_task
            self._cleanup_task = None
        async with self._lock:
            for key, conn in list(self._connections.items()):
                await self._close_connection(conn, reason="shutdown")
            self._connections.clear()
            logger.info("Connection pool: all connections closed")
    # -- public API ---------------------------------------------------------
    async def acquire(
        self, device_key: str, host: str, port: int, timeout: float
    ) -> Tuple[asyncio.StreamReader, asyncio.StreamWriter, bool]:
        """Get a connection for a device (cached or fresh).
        Returns:
            (reader, writer, from_cache) — from_cache is True if reused.
        """
        if self._enable_persistent:
            async with self._lock:
                conn = self._connections.pop(device_key, None)
            if conn is not None:
                if self._is_alive(conn):
                    self._drain_buffer(conn.reader)
                    conn.last_used_at = time.time()
                    logger.info(f"Pool hit for {device_key} (age={time.time() - conn.created_at:.0f}s)")
                    return conn.reader, conn.writer, True
                else:
                    await self._close_connection(conn, reason="stale")
        # Open fresh connection
        reader, writer = await self._open_connection(host, port, timeout)
        logger.info(f"New connection opened for {device_key}")
        return reader, writer, False
    async def release(self, device_key: str, reader: asyncio.StreamReader, writer: asyncio.StreamWriter, host: str, port: int):
        """Return a connection to the pool for reuse."""
        if not self._enable_persistent:
            self._close_writer(writer)
            return
        # Check transport is still healthy before caching
        if writer.transport.is_closing() or reader.at_eof():
            self._close_writer(writer)
            return
        conn = DeviceConnection(
            reader=reader,
            writer=writer,
            device_key=device_key,
            host=host,
            port=port,
        )
        async with self._lock:
            # Evict any existing connection for this device (shouldn't happen
            # under normal locking, but be safe)
            old = self._connections.pop(device_key, None)
            if old is not None:
                await self._close_connection(old, reason="replaced")
            self._connections[device_key] = conn
    async def discard(self, device_key: str):
        """Close and remove a connection from the pool (called on errors)."""
        async with self._lock:
            conn = self._connections.pop(device_key, None)
        if conn is not None:
            await self._close_connection(conn, reason="discarded")
        logger.debug(f"Pool discard for {device_key}")
    def get_stats(self) -> dict:
        """Return pool status for diagnostics."""
        now = time.time()
        connections = {}
        for key, conn in self._connections.items():
            connections[key] = {
                "host": conn.host,
                "port": conn.port,
                "age_seconds": round(now - conn.created_at, 1),
                "idle_seconds": round(now - conn.last_used_at, 1),
                "alive": self._is_alive(conn),
            }
        return {
            "enabled": self._enable_persistent,
            "active_connections": len(self._connections),
            "idle_ttl": self._idle_ttl,
            "max_age": self._max_age,
            "keepalive_idle": self._keepalive_idle,
            "keepalive_interval": self._keepalive_interval,
            "keepalive_count": self._keepalive_count,
            "connections": connections,
        }
    # -- internals ----------------------------------------------------------
    async def _open_connection(
        self, host: str, port: int, timeout: float
    ) -> Tuple[asyncio.StreamReader, asyncio.StreamWriter]:
        """Open a new TCP connection with keepalive options set."""
        try:
            reader, writer = await asyncio.wait_for(
                asyncio.open_connection(host, port), timeout=timeout
            )
        except asyncio.TimeoutError:
            raise ConnectionError(f"Failed to connect to device at {host}:{port}")
        except Exception as e:
            raise ConnectionError(f"Failed to connect to device: {e}")
        # Set TCP keepalive on the underlying socket
        self._set_keepalive(writer)
        return reader, writer
    def _set_keepalive(self, writer: asyncio.StreamWriter):
        """Configure OS-level TCP keepalive on the connection socket."""
        try:
            sock = writer.transport.get_extra_info("socket")
            if sock is None:
                logger.warning("Could not access underlying socket for keepalive")
                return
            sock.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
            # Linux-specific keepalive tuning
            if hasattr(socket, "TCP_KEEPIDLE"):
                sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPIDLE, self._keepalive_idle)
            if hasattr(socket, "TCP_KEEPINTVL"):
                sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPINTVL, self._keepalive_interval)
            if hasattr(socket, "TCP_KEEPCNT"):
                sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_KEEPCNT, self._keepalive_count)
            logger.debug(
                f"TCP keepalive set: idle={self._keepalive_idle}s, "
                f"interval={self._keepalive_interval}s, count={self._keepalive_count}"
            )
        except OSError as e:
            logger.warning(f"Failed to set TCP keepalive: {e}")
    def _is_alive(self, conn: DeviceConnection) -> bool:
        """Check whether a cached connection is still usable."""
        now = time.time()
        # Age / idle checks (value of -1 disables the check)
        if self._idle_ttl >= 0 and now - conn.last_used_at > self._idle_ttl:
            logger.debug(f"Connection {conn.device_key} idle too long ({now - conn.last_used_at:.0f}s > {self._idle_ttl}s)")
            return False
        if self._max_age >= 0 and now - conn.created_at > self._max_age:
            logger.debug(f"Connection {conn.device_key} too old ({now - conn.created_at:.0f}s > {self._max_age}s)")
            return False
        # Transport-level checks
        transport = conn.writer.transport
        if transport.is_closing():
            logger.debug(f"Connection {conn.device_key} transport is closing")
            return False
        if conn.reader.at_eof():
            logger.debug(f"Connection {conn.device_key} reader at EOF")
            return False
        return True
    @staticmethod
    def _drain_buffer(reader: asyncio.StreamReader):
        """Drain any pending bytes (e.g. '$' prompt) from an idle connection."""
        buf = reader._buffer  # noqa: SLF001 — internal but stable across CPython
        if buf:
            pending = bytes(buf)
            buf.clear()
            logger.debug(f"Drained {len(pending)} bytes from cached connection: {pending!r}")
    @staticmethod
    def _close_writer(writer: asyncio.StreamWriter):
        """Close a writer, suppressing errors."""
        try:
            writer.close()
        except Exception:
            pass
    async def _close_connection(self, conn: DeviceConnection, reason: str = ""):
        """Fully close a cached connection."""
        logger.debug(f"Closing connection {conn.device_key} ({reason})")
        conn.writer.close()
        with contextlib.suppress(Exception):
            await conn.writer.wait_closed()
    async def _cleanup_loop(self):
        """Periodically evict idle/expired connections."""
        try:
            while True:
                await asyncio.sleep(30)
                async with self._lock:
                    for key in list(self._connections):
                        conn = self._connections[key]
                        if not self._is_alive(conn):
                            del self._connections[key]
                            await self._close_connection(conn, reason="cleanup")
        except asyncio.CancelledError:
            pass
 # Module-level pool singleton
 _connection_pool = ConnectionPool(
    enable_persistent=TCP_PERSISTENT_ENABLED,
    idle_ttl=TCP_IDLE_TTL,
    max_age=TCP_MAX_AGE,
    keepalive_idle=TCP_KEEPALIVE_IDLE,
    keepalive_interval=TCP_KEEPALIVE_INTERVAL,
    keepalive_count=TCP_KEEPALIVE_COUNT,
 )
 class NL43Client:
    def __init__(self, host: str, port: int, timeout: float = 5.0, ftp_username: str = None, ftp_password: str = None, ftp_port: int = 21):
        self.host = host
@@ -245,7 +534,12 @@ class NL43Client:
        self.device_key = f"{host}:{port}"
    async def _enforce_rate_limit(self):
-        """Ensure ≥1 second between commands to the same device."""
+        """Ensure ≥1 second between commands to the same device.
        NL43 protocol requires ≥1s after the device responds before sending
        the next command. The timestamp is recorded after each command completes
        (connection closed), so we measure from completion, not from send time.
        """
        async with _rate_limit_lock:
            last_time = _last_command_time.get(self.device_key, 0)
            elapsed = time.time() - last_time
@@ -253,7 +547,6 @@ class NL43Client:
                wait_time = 1.0 - elapsed
                logger.debug(f"Rate limiting: waiting {wait_time:.2f}s for {self.device_key}")
                await asyncio.sleep(wait_time)
            _last_command_time[self.device_key] = time.time()
    async def _send_command(self, cmd: str) -> str:
        """Send ASCII command to NL43 device via TCP.
@@ -271,23 +564,62 @@ class NL43Client:
            return await self._send_command_unlocked(cmd)
    async def _send_command_unlocked(self, cmd: str) -> str:
-        """Internal: send command without acquiring device lock (lock must be held by caller)."""
+        """Internal: send command without acquiring device lock (lock must be held by caller).
        Uses the connection pool to reuse cached TCP connections when possible.
        If a cached connection fails, retries once with a fresh connection.
        """
        await self._enforce_rate_limit()
        logger.info(f"Sending command to {self.device_key}: {cmd.strip()}")
        try:
-            reader, writer = await asyncio.wait_for(
+            reader, writer, from_cache = await _connection_pool.acquire(
-                asyncio.open_connection(self.host, self.port), timeout=self.timeout
+                self.device_key, self.host, self.port, self.timeout
            )
-        except asyncio.TimeoutError:
+        except ConnectionError:
-            logger.error(f"Connection timeout to {self.device_key}")
+            logger.error(f"Connection failed to {self.device_key}")
-            raise ConnectionError(f"Failed to connect to device at {self.host}:{self.port}")
+            raise
        except Exception as e:
            logger.error(f"Connection failed to {self.device_key}: {e}")
            raise ConnectionError(f"Failed to connect to device: {str(e)}")
        try:
            response = await self._execute_command(reader, writer, cmd)
            # Success — return connection to pool for reuse
            await _connection_pool.release(self.device_key, reader, writer, self.host, self.port)
            _last_command_time[self.device_key] = time.time()
            return response
        except Exception as e:
            # Discard the bad connection
            await _connection_pool.discard(self.device_key)
            ConnectionPool._close_writer(writer)
            if from_cache:
                # Retry once with a fresh connection — the cached one may have gone stale
                logger.warning(f"Cached connection failed for {self.device_key}, retrying fresh: {e}")
                await self._enforce_rate_limit()
                try:
                    reader, writer, _ = await _connection_pool.acquire(
                        self.device_key, self.host, self.port, self.timeout
                    )
                except ConnectionError:
                    logger.error(f"Retry connection also failed to {self.device_key}")
                    raise
                try:
                    response = await self._execute_command(reader, writer, cmd)
                    await _connection_pool.release(self.device_key, reader, writer, self.host, self.port)
                    _last_command_time[self.device_key] = time.time()
                    return response
                except Exception:
                    await _connection_pool.discard(self.device_key)
                    ConnectionPool._close_writer(writer)
                    raise
            else:
                raise
    async def _execute_command(self, reader: asyncio.StreamReader, writer: asyncio.StreamWriter, cmd: str) -> str:
        """Send a command over an existing connection and parse the NL43 response."""
        writer.write(cmd.encode("ascii"))
        await writer.drain()
@@ -303,7 +635,7 @@ class NL43Client:
        # Check result code
        if result_code == "R+0000":
-                # Success - for query commands, read the second line with actual data
+            # Success — for query commands, read the second line with actual data
            is_query = cmd.strip().endswith("?")
            if is_query:
                data_line = await asyncio.wait_for(reader.readuntil(b"\n"), timeout=self.timeout)
@@ -311,7 +643,7 @@ class NL43Client:
                logger.debug(f"Data line from {self.device_key}: {response}")
                return response
            else:
-                    # Setting command - return success code
+                # Setting command — return success code
                return result_code
        elif result_code == "R+0001":
            raise ValueError("Command error - device did not recognize command")
@@ -324,17 +656,6 @@ class NL43Client:
        else:
            raise ValueError(f"Unknown result code: {result_code}")
        except asyncio.TimeoutError:
            logger.error(f"Response timeout from {self.device_key}")
            raise TimeoutError(f"Device did not respond within {self.timeout}s")
        except Exception as e:
            logger.error(f"Communication error with {self.device_key}: {e}")
            raise
        finally:
            writer.close()
            with contextlib.suppress(Exception):
                await writer.wait_closed()
    async def request_dod(self) -> NL43Snapshot:
        """Request DOD (Data Output Display) snapshot from device.
@@ -575,20 +896,19 @@ class NL43Client:
        # Acquire per-device lock - held for entire streaming session
        device_lock = await _get_device_lock(self.device_key)
        async with device_lock:
            # Evict any cached connection — streaming needs its own dedicated socket
            await _connection_pool.discard(self.device_key)
            await self._enforce_rate_limit()
            logger.info(f"Starting DRD stream for {self.device_key}")
            try:
-                reader, writer = await asyncio.wait_for(
+                reader, writer = await _connection_pool._open_connection(
-                    asyncio.open_connection(self.host, self.port), timeout=self.timeout
+                    self.host, self.port, self.timeout
                )
-            except asyncio.TimeoutError:
+            except ConnectionError:
-                logger.error(f"DRD stream connection timeout to {self.device_key}")
+                logger.error(f"DRD stream connection failed to {self.device_key}")
-                raise ConnectionError(f"Failed to connect to device at {self.host}:{self.port}")
+                raise
            except Exception as e:
                logger.error(f"DRD stream connection failed to {self.device_key}: {e}")
                raise ConnectionError(f"Failed to connect to device: {str(e)}")
            try:
                # Start DRD streaming
@@ -1381,11 +1701,42 @@ class NL43Client:
        result["stopped"] = True
        logger.info(f"[STOP-CYCLE] Measurement stopped")
-        # Step 2: Enable FTP
+        # Step 2: Reset FTP (disable then enable) to clear any stale state
-        logger.info(f"[STOP-CYCLE] Step 2: Enabling FTP")
+        logger.info(f"[STOP-CYCLE] Step 2: Resetting FTP (disable then enable)")
        try:
            await self.disable_ftp()
            logger.info(f"[STOP-CYCLE] FTP disabled")
        except Exception as e:
            logger.warning(f"[STOP-CYCLE] FTP disable failed (may already be off): {e}")
        await self.enable_ftp()
        logger.info(f"[STOP-CYCLE] FTP enable command sent")
        # Step 2b: Wait and verify FTP is ready (NL-43 needs time to start FTP server)
        ftp_ready_timeout = 30  # seconds
        ftp_check_interval = 2  # seconds
        ftp_ready = False
        elapsed = 0
        logger.info(f"[STOP-CYCLE] Step 2b: Waiting up to {ftp_ready_timeout}s for FTP server to be ready")
        while elapsed < ftp_ready_timeout:
            await asyncio.sleep(ftp_check_interval)
            elapsed += ftp_check_interval
            try:
                ftp_status = await self.get_ftp_status()
                logger.info(f"[STOP-CYCLE] FTP status check at {elapsed}s: {ftp_status}")
                if ftp_status.lower() == "on":
                    ftp_ready = True
                    logger.info(f"[STOP-CYCLE] FTP server confirmed ready after {elapsed}s")
                    break
            except Exception as e:
                logger.warning(f"[STOP-CYCLE] FTP status check failed at {elapsed}s: {e}")
        if ftp_ready:
            result["ftp_enabled"] = True
-        logger.info(f"[STOP-CYCLE] FTP enabled")
+            logger.info(f"[STOP-CYCLE] FTP enabled and verified")
        else:
            logger.warning(f"[STOP-CYCLE] FTP not confirmed ready after {ftp_ready_timeout}s, proceeding anyway")
            result["ftp_enabled"] = True  # Command was sent, just not verified
        if not download:
            logger.info(f"[STOP-CYCLE] === Cycle complete (download=False) ===")
--- a/templates/roster.html
+++ b/templates/roster.html
@@ -3,7 +3,7 @@
 <head>
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-  <title>SLMM Roster - Sound Level Meter Configuration</title>
+  <title>SLMM - Device Roster &amp; Connections</title>
  <style>
    * { box-sizing: border-box; }
    body {
@@ -227,19 +227,119 @@
    }
    .toast-success { background: #2da44e; }
    .toast-error { background: #cf222e; }
    /* Tabs */
    .tabs {
      display: flex;
      gap: 0;
      margin-bottom: 0;
      border-bottom: 2px solid #d0d7de;
    }
    .tab-btn {
      padding: 10px 20px;
      border: none;
      background: none;
      cursor: pointer;
      font-size: 14px;
      font-weight: 600;
      color: #57606a;
      border-bottom: 2px solid transparent;
      margin-bottom: -2px;
      transition: color 0.2s, border-color 0.2s;
    }
    .tab-btn:hover { color: #24292f; }
    .tab-btn.active {
      color: #24292f;
      border-bottom-color: #fd8c73;
    }
    .tab-panel { display: none; }
    .tab-panel.active { display: block; }
    /* Connection pool panel */
    .pool-config {
      display: grid;
      grid-template-columns: repeat(auto-fill, minmax(180px, 1fr));
      gap: 12px;
      margin-bottom: 20px;
    }
    .pool-config-card {
      background: #f6f8fa;
      border: 1px solid #d0d7de;
      border-radius: 6px;
      padding: 12px;
    }
    .pool-config-card .label {
      font-size: 11px;
      color: #57606a;
      text-transform: uppercase;
      font-weight: 600;
      margin-bottom: 4px;
    }
    .pool-config-card .value {
      font-size: 18px;
      font-weight: 600;
      color: #24292f;
    }
    .conn-card {
      background: white;
      border: 1px solid #d0d7de;
      border-radius: 6px;
      padding: 16px;
      margin-bottom: 12px;
    }
    .conn-card-header {
      display: flex;
      justify-content: space-between;
      align-items: center;
      margin-bottom: 12px;
    }
    .conn-card-header strong { font-size: 15px; }
    .conn-card-grid {
      display: grid;
      grid-template-columns: repeat(auto-fill, minmax(140px, 1fr));
      gap: 8px;
    }
    .conn-stat .label {
      font-size: 11px;
      color: #57606a;
      text-transform: uppercase;
      font-weight: 600;
    }
    .conn-stat .value {
      font-size: 14px;
      font-weight: 600;
      color: #24292f;
    }
    .conn-empty {
      text-align: center;
      padding: 32px;
      color: #57606a;
    }
    .pool-actions {
      display: flex;
      gap: 8px;
      margin-bottom: 16px;
    }
  </style>
 </head>
 <body>
  <div class="container">
    <div class="header">
-      <h1>📊 Sound Level Meter Roster</h1>
+      <h1>SLMM - Roster &amp; Connections</h1>
      <div class="nav">
-        <a href="/" class="btn">← Back to Control Panel</a>
+        <a href="/" class="btn">&larr; Back to Control Panel</a>
        <button class="btn btn-primary" onclick="openAddModal()">+ Add Device</button>
      </div>
    </div>
-    <div class="table-container">
+    <div class="tabs">
      <button class="tab-btn active" onclick="switchTab('roster')">Device Roster</button>
      <button class="tab-btn" onclick="switchTab('connections')">Connections</button>
    </div>
    <!-- Roster Tab -->
    <div id="tab-roster" class="tab-panel active">
      <div class="table-container" style="border-top-left-radius: 0; border-top-right-radius: 0;">
        <table id="rosterTable">
          <thead>
            <tr>
@@ -265,6 +365,30 @@
      </div>
    </div>
    <!-- Connections Tab -->
    <div id="tab-connections" class="tab-panel">
      <div class="table-container" style="padding: 20px; border-top-left-radius: 0; border-top-right-radius: 0;">
        <div class="pool-actions">
          <button class="btn" onclick="loadConnections()">Refresh</button>
          <button class="btn btn-danger" onclick="flushConnections()">Flush All Connections</button>
        </div>
        <h3 style="margin: 0 0 12px 0; font-size: 16px;">Pool Configuration</h3>
        <div id="poolConfig" class="pool-config">
          <div class="pool-config-card">
            <div class="label">Status</div>
            <div class="value" id="poolEnabled">--</div>
          </div>
        </div>
        <h3 style="margin: 20px 0 12px 0; font-size: 16px;">Active Connections</h3>
        <div id="connectionsList">
          <div class="conn-empty">Loading...</div>
        </div>
      </div>
    </div>
  </div>
  <!-- Add/Edit Modal -->
  <div id="deviceModal" class="modal">
    <div class="modal-content">
@@ -619,6 +743,159 @@
        closeModal();
      }
    });
    // ========== Tab Switching ==========
    function switchTab(tabName) {
      document.querySelectorAll('.tab-btn').forEach(btn => btn.classList.remove('active'));
      document.querySelectorAll('.tab-panel').forEach(panel => panel.classList.remove('active'));
      document.querySelector(`.tab-btn[onclick="switchTab('${tabName}')"]`).classList.add('active');
      document.getElementById(`tab-${tabName}`).classList.add('active');
      if (tabName === 'connections') {
        loadConnections();
      }
    }
    // ========== Connection Pool ==========
    let connectionsRefreshTimer = null;
    async function loadConnections() {
      try {
        const res = await fetch('/api/nl43/_connections/status');
        const data = await res.json();
        if (!res.ok) {
          showToast('Failed to load connection pool status', 'error');
          return;
        }
        const pool = data.pool;
        renderPoolConfig(pool);
        renderConnections(pool.connections);
        // Auto-refresh while tab is active
        clearTimeout(connectionsRefreshTimer);
        if (document.getElementById('tab-connections').classList.contains('active')) {
          connectionsRefreshTimer = setTimeout(loadConnections, 5000);
        }
      } catch (err) {
        showToast('Error loading connections: ' + err.message, 'error');
        console.error('Load connections error:', err);
      }
    }
    function renderPoolConfig(pool) {
      document.getElementById('poolConfig').innerHTML = `
        <div class="pool-config-card">
          <div class="label">Persistent</div>
          <div class="value" style="color: ${pool.enabled ? '#1a7f37' : '#cf222e'}">${pool.enabled ? 'Enabled' : 'Disabled'}</div>
        </div>
        <div class="pool-config-card">
          <div class="label">Active</div>
          <div class="value">${pool.active_connections}</div>
        </div>
        <div class="pool-config-card">
          <div class="label">Idle TTL</div>
          <div class="value">${pool.idle_ttl}s</div>
        </div>
        <div class="pool-config-card">
          <div class="label">Max Age</div>
          <div class="value">${pool.max_age}s</div>
        </div>
        <div class="pool-config-card">
          <div class="label">KA Idle</div>
          <div class="value">${pool.keepalive_idle}s</div>
        </div>
        <div class="pool-config-card">
          <div class="label">KA Interval</div>
          <div class="value">${pool.keepalive_interval}s</div>
        </div>
        <div class="pool-config-card">
          <div class="label">KA Probes</div>
          <div class="value">${pool.keepalive_count}</div>
        </div>
      `;
    }
    function renderConnections(connections) {
      const container = document.getElementById('connectionsList');
      const keys = Object.keys(connections);
      if (keys.length === 0) {
        container.innerHTML = `
          <div class="conn-empty">
            <div style="font-size: 32px; margin-bottom: 8px;">~</div>
            <div><strong>No active connections</strong></div>
            <div style="margin-top: 4px; font-size: 13px;">
              Connections appear here when devices are actively being polled and the connection is cached between commands.
            </div>
          </div>
        `;
        return;
      }
      container.innerHTML = keys.map(key => {
        const conn = connections[key];
        const aliveColor = conn.alive ? '#1a7f37' : '#cf222e';
        const aliveText = conn.alive ? 'Alive' : 'Stale';
        return `
          <div class="conn-card">
            <div class="conn-card-header">
              <strong>${escapeHtml(key)}</strong>
              <span class="status-badge ${conn.alive ? 'status-ok' : 'status-error'}">${aliveText}</span>
            </div>
            <div class="conn-card-grid">
              <div class="conn-stat">
                <div class="label">Host</div>
                <div class="value">${escapeHtml(conn.host)}</div>
              </div>
              <div class="conn-stat">
                <div class="label">Port</div>
                <div class="value">${conn.port}</div>
              </div>
              <div class="conn-stat">
                <div class="label">Age</div>
                <div class="value">${formatSeconds(conn.age_seconds)}</div>
              </div>
              <div class="conn-stat">
                <div class="label">Idle</div>
                <div class="value">${formatSeconds(conn.idle_seconds)}</div>
              </div>
            </div>
          </div>
        `;
      }).join('');
    }
    function formatSeconds(s) {
      if (s < 60) return Math.round(s) + 's';
      if (s < 3600) return Math.floor(s / 60) + 'm ' + Math.round(s % 60) + 's';
      return Math.floor(s / 3600) + 'h ' + Math.floor((s % 3600) / 60) + 'm';
    }
    async function flushConnections() {
      if (!confirm('Close all cached TCP connections?\n\nDevices will reconnect on the next poll cycle.')) {
        return;
      }
      try {
        const res = await fetch('/api/nl43/_connections/flush', { method: 'POST' });
        const data = await res.json();
        if (!res.ok) {
          showToast(data.detail || 'Failed to flush connections', 'error');
          return;
        }
        showToast('All connections flushed', 'success');
        await loadConnections();
      } catch (err) {
        showToast('Error flushing connections: ' + err.message, 'error');
      }
    }
  </script>
 </body>
 </html>
--- a/test_sleep_mode_auto_disable.py
+++ b/test_sleep_mode_auto_disable.py
@@ -1,128 +0,0 @@
 #!/usr/bin/env python3
 """
 Test script to verify that sleep mode is automatically disabled when:
 1. Device configuration is created/updated with TCP enabled
 2. Measurements are started
 This script tests the API endpoints, not the actual device communication.
 """
 import requests
 import json
 BASE_URL = "http://localhost:8100/api/nl43"
 UNIT_ID = "test-nl43-001"
 def test_config_update():
    """Test that config update works (actual sleep mode disable requires real device)"""
    print("\n=== Testing Config Update ===")
    # Create/update a device config
    config_data = {
        "host": "192.168.1.100",
        "tcp_port": 2255,
        "tcp_enabled": True,
        "ftp_enabled": False,
        "ftp_username": "admin",
        "ftp_password": "password"
    }
    print(f"Updating config for {UNIT_ID}...")
    response = requests.put(f"{BASE_URL}/{UNIT_ID}/config", json=config_data)
    if response.status_code == 200:
        print("✓ Config updated successfully")
        print(f"Response: {json.dumps(response.json(), indent=2)}")
        print("\nNote: Sleep mode disable was attempted (will succeed if device is reachable)")
        return True
    else:
        print(f"✗ Config update failed: {response.status_code}")
        print(f"Error: {response.text}")
        return False
 def test_get_config():
    """Test retrieving the config"""
    print("\n=== Testing Get Config ===")
    response = requests.get(f"{BASE_URL}/{UNIT_ID}/config")
    if response.status_code == 200:
        print("✓ Config retrieved successfully")
        print(f"Response: {json.dumps(response.json(), indent=2)}")
        return True
    elif response.status_code == 404:
        print("✗ Config not found (create one first)")
        return False
    else:
        print(f"✗ Request failed: {response.status_code}")
        print(f"Error: {response.text}")
        return False
 def test_start_measurement():
    """Test that start measurement attempts to disable sleep mode"""
    print("\n=== Testing Start Measurement ===")
    print(f"Attempting to start measurement on {UNIT_ID}...")
    response = requests.post(f"{BASE_URL}/{UNIT_ID}/start")
    if response.status_code == 200:
        print("✓ Start command accepted")
        print(f"Response: {json.dumps(response.json(), indent=2)}")
        print("\nNote: Sleep mode was disabled before starting measurement")
        return True
    elif response.status_code == 404:
        print("✗ Device config not found (create config first)")
        return False
    elif response.status_code == 502:
        print("✗ Device not reachable (expected if no physical device)")
        print(f"Response: {response.text}")
        print("\nNote: This is expected behavior when testing without a physical device")
        return True  # This is actually success - the endpoint tried to communicate
    else:
        print(f"✗ Request failed: {response.status_code}")
        print(f"Error: {response.text}")
        return False
 def main():
    print("=" * 60)
    print("Sleep Mode Auto-Disable Test")
    print("=" * 60)
    print("\nThis test verifies that sleep mode is automatically disabled")
    print("when device configs are updated or measurements are started.")
    print("\nNote: Without a physical device, some operations will fail at")
    print("the device communication level, but the API logic will execute.")
    # Run tests
    results = []
    # Test 1: Update config (should attempt to disable sleep mode)
    results.append(("Config Update", test_config_update()))
    # Test 2: Get config
    results.append(("Get Config", test_get_config()))
    # Test 3: Start measurement (should attempt to disable sleep mode)
    results.append(("Start Measurement", test_start_measurement()))
    # Summary
    print("\n" + "=" * 60)
    print("Test Summary")
    print("=" * 60)
    for test_name, result in results:
        status = "✓ PASS" if result else "✗ FAIL"
        print(f"{status}: {test_name}")
    print("\n" + "=" * 60)
    print("Implementation Details:")
    print("=" * 60)
    print("1. Config endpoint is now async and calls ensure_sleep_mode_disabled()")
    print("   when TCP is enabled")
    print("2. Start measurement endpoint calls ensure_sleep_mode_disabled()")
    print("   before starting the measurement")
    print("3. Sleep mode check is non-blocking - config/start will succeed")
    print("   even if the device is unreachable")
    print("=" * 60)
 if __name__ == "__main__":
    main()
Author	SHA1	Message	Date
serversdown	450509d210	stop tracking dev runtime data	2026-03-12 22:46:37 +00:00
serversdown	fefa9eace8	chore: gitignore clean up	2026-03-12 21:34:14 +00:00
serversdown	98a8d357e5	chore: data-dev folder added to gitignore	2026-03-12 21:33:43 +00:00
serversdwn	0a7422eceb	Merge branch 'dev-persistent' of ssh://10.0.0.2:2222/serversdown/slmm into dev-persistent	2026-03-12 20:26:56 +00:00
serversdwn	996b993cb9	chore: gitignore dev data	2026-03-12 20:26:53 +00:00
serversdwn	01337696b3	feat: add connection pool status logging every 15 minutes	2026-02-19 15:09:50 +00:00
serversdwn	a302fd15d4	fix: change debug logs to info level for connection pool events	2026-02-19 06:04:34 +00:00
serversdwn	af5ecc1a92	fix: improve connection pool idle and max age checks to allow disabling	2026-02-19 01:25:01 +00:00
serversdwn	b62e84f8b3	v0.3.0, persistent polling update.	2026-02-17 02:56:11 +00:00
serversdwn	a5f8d1b2c7	Persistent polling interval increased. Healthcheck now uses poll instead of separate handshakes.	2026-02-17 02:41:09 +00:00
serversdwn	a1a80bbb4d	add: new persisent connection approach, env variables for tcp keepalive and persist, added connection pool class.	2026-02-16 04:25:51 +00:00
serversdwn	005e0091fe	fix: delay added to ensure tcp commands dont talk over eachother	2026-02-16 02:42:41 +00:00
serversdwn	e6ac80df6c	chore: add pcap files to gitignore	2026-02-10 21:12:19 +00:00
serversdwn	7070b948a8	add: stress test script for diagnosing TCP connection issues. chore: clean up .gitignore	2026-02-10 07:07:34 +00:00
serversdwn	3b6e9ad3f0	fix: time added to FTP enable step to prevent commands getting messed up	2026-02-06 17:37:10 +00:00
serversdwn	eb0cbcc077	fix: 24hr restart schedule enchanced. Step 0: Pause polling Step 1: Stop measurement → wait 10s Step 2: Disable FTP → wait 10s Step 3: Enable FTP → wait 10s Step 4: Download data Step 5: Wait 30s for device to settle Step 6: Start new measurement Step 7: Re-enable polling	2026-01-31 05:15:00 +00:00
serversdwn	cc0a5bdf84	chore cleanup	2026-01-29 22:44:20 +00:00