Merge pull request 'v0.3.0, persistent polling update. Persistent TCP connection pool with all features Connection pool diagnostics (API + UI) All 6 new environment variables Changes to health check, diagnostics, and DRD streaming Technical architecture details and cellular' (#2 ) from dev-persistent into main

Reviewed-on: #2
2026-02-16 21:57:37 -05:00
6 changed files with 5 additions and 437 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,6 +1,5 @@
 /manuals/
 /data/
 /data-dev/
 /SLM-stress-test/stress_test_logs/
 /SLM-stress-test/tcpdump-runs/
--- a/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md
+++ b/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md
@@ -1,403 +0,0 @@
 # NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps
 **Last updated:** 2026-02-18  
 **Owner:** Brian / serversdown  
 **Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55
 ---
 ## 0) What this document is
 This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43’s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable.
 This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**.
 ---
 ## 1) Architecture (as tested)
 ### Network path
 - **Server (SLMM host):** `10.0.0.40`
 - **RX55 WAN IP:** `63.45.161.30`
 - **RX55 LAN subnet:** `192.168.1.0/24`
 - **RX55 LAN gateway:** `192.168.1.1`
 - **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation)
 ### RX55 details
 - **Sierra Wireless RX55**
 - **OS:** 5.2
 - **Firmware:** `01.14.24.00`
 - **Carrier:** Verizon LTE (Band 66)
 ### Port forwarding rules (RX55)
 - **WAN:2255 → NL-43:2255**  (NL-43 TCP control)
 - **WAN:21   → NL-43:21**    (NL-43 FTP control)
 You also experimented with additional forwards:
 - **WAN:2253 → NL-43:2255** (test)
 - **WAN:2253 → NL-43:2253** (test)
 - **WAN:4450 → NL-43:4450** (test)
 **Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x.
 ---
 ## 2) Original problem statement (the “wedge”)
 After running for hours, the NL-43 becomes unreachable over TCP control.
 ### Symptom signature (WAN-side)
 - Client attempts to connect to `63.45.161.30:2255`
 - Instead of timing out, the client gets **connection refused** quickly.
 - Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop)
 ### Critical operational behavior
 - **Power cycling the NL-43 fixes it.**
 - **Power cycling the RX55 does NOT fix it.**
 - FTP sometimes remains available even while TCP control (2255) is dead.
 This combination is what forced us to determine whether:
 - The RX55 is rejecting connections, OR
 - The NL-43 is no longer listening on 2255, OR
 - Something about the RX55 path triggers the NL-43’s control listener to die.
 ---
 ## 3) Event timeline evidence (SLMM logs)
 A concrete wedge window was observed on **2026-02-18**:
 - 10:55:46 AM — Poll success (Start)
 - 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded)
 - 11:55:50 AM — Poll success (Stop)
 - 12:55:55 PM — Poll success (Stop)
 - **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)**
 - 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused)
 Key interpretation:
 - The wedge occurred sometime between **12:55 and 1:55**.
 - The failure type is **refused**, not timeout.
 ---
 ## 4) Early hypotheses (before proof)
 We considered two main buckets:
 ### A) NL-43-side failure (most suspicious)
 - NL-43 TCP control service crashes / exits / unbinds from 2255
 - socket leak / accept backlog exhaustion
 - “single control session allowed” and it gets stuck thinking a session is active
 - mode/service manager bug (service restart fails after other activities)
 - firmware bug in TCP daemon
 ### B) RX55-side failure (possible trigger / less likely once FTP works)
 - NAT/forwarding table corruption
 - firewall behavior
 - helper/ALG interference
 - MSS/MTU weirdness causing edge-case behavior
 - session churn behavior causing downstream issues
 ---
 ## 5) Key experiments and what they proved
 ### 5.1) LAN-only stability test (No RX55 path)
 **Test:** NL-43 tested directly on LAN (no modem path involved).
 - Ran **24+ hours**
 - Scheduler start/stop cycles worked
 - Stress test: **500 commands @ 1/sec** → no failure
 - Response time trend decreased (not degrading)
 **Result:** The NL-43 appears stable in a “pure LAN” environment.
 **Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime.
 ---
 ### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization)
 You observed:
 - **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop)
 - **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST)
 Concrete example:
 - Port **4450** with no rule → timeout
 - Port **4450 → NL-43:4450** rule created → connection refused
 **Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target.
 Important nuance:
 - A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop.
 ---
 ### 5.3) UDP test confusion (and resolution)
 You ran:
 ```bash
 nc -vzu 63.45.161.30 2255
 nc -vz  63.45.161.30 2255
 ```
 Observed:
 - UDP: “succeeded”
 - TCP: “connection refused”
 Resolution:
 - UDP has **no handshake**. netcat prints “succeeded” if it doesn’t immediately receive an ICMP unreachable. It does **not** mean a UDP service exists.
 - TCP refused is meaningful: a RST implies “no listener” or “actively rejected.”
 **Net effect:** UDP test did not change the diagnosis.
 ---
 ### 5.4) Packet capture proof (WAN-side)
 You captured a Wireshark/tcpdump summary with these key patterns:
 #### Port 2255 (TCP control)
 Example:
 - `10.0.0.40 → 63.45.161.30:2255` SYN
 - `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms
 This happened repeatedly.
 #### Port 2253 (test port)
 Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment).
 #### Port 21 (FTP)
 Clean 3-way handshake:
 - SYN → SYN/ACK → ACK
 Then:
 - FTP server banner: `220 Connection Ready`
 Then:
 - `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment)
 Session closes cleanly.
 **Key takeaway from capture:**
 - TCP transport to NL-43 via RX55 is definitely working (port 21 proves it).
 - Port 2255 is being actively refused.
 This strongly suggested “2255 listener is gone,” but still didn’t fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43.
 ---
 ## 6) The decisive experiment: LAN-side test while wedged (final proof)
 Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**.
 ### 6.1) Physical LAN tap setup
 Constraint:
 - NL-43 has only one Ethernet port.
 Solution:
 - Insert an unmanaged switch:
  - RX55 LAN → switch
  - NL-43 → switch
  - Windows 10 laptop → switch
 This creates a shared L2 segment where the laptop can test NL-43 directly.
 ### 6.2) Windows LAN validation
 On the Windows laptop:
 - `ipconfig` showed:
  - IP: `192.168.1.100`
  - Gateway: `192.168.1.1` (RX55)
 - Initial `arp -a` only showed RX55, not NL-43.
 You then:
 - pinged likely host addresses and discovered NL-43 responds on **192.168.1.10**
 - `arp -a` then showed:
  - `192.168.1.10 → 00-10-50-14-0a-d8`
  - OUI `00-10-50` recognized as **Rion** (matches NL-43)
 So LAN identities were confirmed:
 - RX55: `192.168.1.1`
 - NL-43: `192.168.1.10`
 ### 6.3) The LAN port tests (the smoking gun)
 From Windows:
 ```powershell
 Test-NetConnection -ComputerName 192.168.1.10 -Port 2255
 Test-NetConnection -ComputerName 192.168.1.10 -Port 21
 ```
 Results (while the unit was “wedged” from the WAN perspective):
 - **2255:** `TcpTestSucceeded : False`
 - **21:**   `TcpTestSucceeded : True`
 **Conclusion (PROVEN):**
 - The NL-43 is reachable on the LAN
 - FTP port 21 is alive
 - **The NL-43 is NOT listening on TCP port 2255**
 - Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255.
 This is now settled.
 ---
 ## 7) What we learned (final conclusions)
 ### 7.1) RX55 innocence (for this failure mode)
 The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared.
 It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*.
 ### 7.2) NL-43 control listener failure
 The NL-43’s TCP control service (port 2255) stops listening while:
 - the device remains alive
 - the LAN stack remains alive (ping)
 - FTP remains alive (port 21)
 This looks like one of:
 - control daemon crash/exit
 - service unbind
 - stuck service state (e.g., “busy” / “session active forever”)
 - resource leak (sockets/file descriptors) specific to the control service
 - firmware service manager bug (start/stop of services fails after certain sequences)
 ---
 ## 8) Additional constraint discovered: “Web App mode” conflicts
 You noted an important operational constraint:
 > Turning on the web app disables other interfaces like TCP and FTP.
 Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because:
 - If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle.
 - It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed.
 We have not yet run a controlled long test to determine whether:
 - mode switching contributes directly to the 2255 listener dying, OR
 - it happens even in a pure TCP-only mode with no switching.
 ---
 ## 9) Immediate operational decision (field tomorrow)
 Because the device is needed in the field immediately, you chose:
 - **Old-school manual deployment**
 - **Manual SD card downloads**
 - Avoid reliance on 2255/TCP control and remote workflows for now.
 **Important operational note:**
 The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency.
 ---
 ## 10) What’s next (future work — when the unit is back)
 Because long tests can’t be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation.
 ### 10.1) Controlled experiment matrix (recommended)
 Run each test for 24–72 hours, or until wedge occurs, and record:
 - number of TCP connects
 - whether connections are persistent
 - whether FTP is used
 - whether any mode toggling is performed
 - time-to-wedge
 #### Test A — TCP-only (ideal baseline)
 - TCP control only (2255)
 - **True persistent connection** (open once, keep forever)
 - No FTP
 - No web mode toggling
 Outcome interpretation:
 - If stable: connection churn and/or FTP/mode switching is the trigger.
 - If wedges anyway: pure 2255 daemon leak/bug.
 #### Test B — TCP with connection churn
 - Same as A but intentionally reconnect on a schedule (current SLMM behavior)
 - No FTP
 Outcome:
 - If this wedges but A doesn’t: churn is the trigger.
 #### Test C — FTP activity + TCP
 - Introduce scheduled FTP sessions (downloads) while using TCP control
 - Observe whether wedge correlates with FTP use or with post-download periods.
 Outcome:
 - If wedge correlates with FTP, suspect internal service lifecycle conflict.
 #### Test D — Web mode interaction (only if safe/possible)
 - Evaluate what toggling web mode does to TCP/FTP services.
 - Determine if any remote-safe “soft reset” exists.
 ---
 ## 11) Mitigation options (ranked)
 ### Option 1 — Make SLMM truly persistent (highest probability of success)
 If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is:
 - Open one TCP socket per device
 - Keep it open indefinitely
 - Use OS keepalive
 - Do **not** rotate connections on timers
 - Reconnect only when the socket actually dies
 This reduces:
 - connect/close cycles
 - NAT edge-case exposure
 - resource churn inside NL-43
 ### Option 2 — Service “soft reset” (if possible without disabling required services)
 If there exists any way to restart the 2255 service without power cycling:
 - LAN TCP toggle (if it doesn’t require web mode)
 - any “restart comms” command (unknown)
 - any maintenance menu sequence
 then SLMM could:
 - detect wedge
 - trigger soft reset
 - recover automatically
 Current constraint: web app mode appears to disable other services, so this may not be viable.
 ### Option 3 — Hardware watchdog power cycle (industrial but reliable)
 If this is a firmware bug with no clean workaround:
 - Add a remotely controlled relay/power switch
 - On wedge detection, power-cycle NL-43 automatically
 - Optionally schedule a nightly power cycle to prevent leak accumulation
 This is “field reality” and often the only long-term move with embedded devices.
 ### Option 4 — Vendor escalation (Rion)
 You now have excellent evidence:
 - LAN-side proof: 2255 dead while 21 alive
 - WAN packet evidence
 - clear isolation of RX55 innocence
 This is strong enough to send to Rion support as a firmware defect report.
 ---
 ## 12) Repro “wedge bundle” checklist (for future captures)
 When the wedge happens again, capture these before power cycling:
 1) From server:
 - `nc -vz 63.45.161.30 2255` (expect refused)
 - `nc -vz 63.45.161.30 21`   (expect success if FTP alive)
 2) From LAN side (via switch/laptop):
 - `Test-NetConnection 192.168.1.10 -Port 2255`
 - `Test-NetConnection 192.168.1.10 -Port 21`
 3) Optional: packet capture around the refused attempt.
 4) Record:
 - last successful poll timestamp
 - last FTP session timestamp
 - any scheduled start/stop/download cycles near wedge time
 - SLMM connection reuse/rotation settings in effect
 ---
 ## 13) Final, current-state summary (as of 2026-02-18)
 - The issue is **NOT** the RX55 rejecting inbound connections.
 - The NL-43 is **alive**, reachable on LAN, and FTP works.
 - The NL-43’s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy.
 - The wedge can occur hours after successful operations.
 - The unit is needed in the field immediately, so investigation pauses.
 - Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset).
 ---
 ## 14) Notes / misc observations
 - The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMM’s “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow.
 - UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned.
 ---
 **End of document.**
--- a/app/background_poller.py
+++ b/app/background_poller.py
@@ -38,7 +38,6 @@ class BackgroundPoller:
        self._running = False
        self._logger = logger
        self._last_cleanup = None  # Track last log cleanup time
        self._last_pool_log = None  # Track last connection pool heartbeat log
    async def start(self):
        """Start the background polling task."""
@@ -90,24 +89,6 @@ class BackgroundPoller:
            except Exception as e:
                self._logger.warning(f"Log cleanup failed: {e}")
            # Log connection pool status every 15 minutes
            try:
                now = datetime.utcnow()
                if self._last_pool_log is None or (now - self._last_pool_log).total_seconds() > 900:
                    from app.services import _connection_pool
                    stats = _connection_pool.get_stats()
                    conns = stats.get("connections", {})
                    if conns:
                        for key, c in conns.items():
                            self._logger.info(
                                f"[POOL] {key} — age={c['age_seconds']}s idle={c['idle_seconds']}s alive={c['alive']}"
                            )
                    else:
                        self._logger.info("[POOL] No active connections in pool")
                    self._last_pool_log = now
            except Exception as e:
                self._logger.warning(f"Pool status log failed: {e}")
            # Calculate dynamic sleep interval
            sleep_time = self._calculate_sleep_interval()
            self._logger.debug(f"Sleeping for {sleep_time} seconds until next poll cycle")
--- a/app/services.py
+++ b/app/services.py
@@ -338,14 +338,14 @@ class ConnectionPool:
                if self._is_alive(conn):
                    self._drain_buffer(conn.reader)
                    conn.last_used_at = time.time()
-                    logger.info(f"Pool hit for {device_key} (age={time.time() - conn.created_at:.0f}s)")
+                    logger.debug(f"Pool hit for {device_key} (age={time.time() - conn.created_at:.0f}s)")
                    return conn.reader, conn.writer, True
                else:
                    await self._close_connection(conn, reason="stale")
        # Open fresh connection
        reader, writer = await self._open_connection(host, port, timeout)
-        logger.info(f"New connection opened for {device_key}")
+        logger.debug(f"New connection opened for {device_key}")
        return reader, writer, False
    async def release(self, device_key: str, reader: asyncio.StreamReader, writer: asyncio.StreamWriter, host: str, port: int):
@@ -454,11 +454,11 @@ class ConnectionPool:
        """Check whether a cached connection is still usable."""
        now = time.time()
-        # Age / idle checks (value of -1 disables the check)
+        # Age / idle checks
-        if self._idle_ttl >= 0 and now - conn.last_used_at > self._idle_ttl:
+        if now - conn.last_used_at > self._idle_ttl:
            logger.debug(f"Connection {conn.device_key} idle too long ({now - conn.last_used_at:.0f}s > {self._idle_ttl}s)")
            return False
-        if self._max_age >= 0 and now - conn.created_at > self._max_age:
+        if now - conn.created_at > self._max_age:
            logger.debug(f"Connection {conn.device_key} too old ({now - conn.created_at:.0f}s > {self._max_age}s)")
            return False
--- a/data-dev/slmm.db
+++ b/data-dev/slmm.db
--- a/data-dev/slmm.log
+++ b/data-dev/slmm.log
@@ -1,9 +0,0 @@
 2026-03-12 21:29:42,683 - app.main - INFO - Database tables initialized
 2026-03-12 21:29:42,684 - app.main - INFO - CORS allowed origins: ['*']
 2026-03-12 21:29:42,703 - app.main - INFO - Starting TCP connection pool cleanup task...
 2026-03-12 21:29:42,703 - app.services - INFO - Connection pool cleanup task started
 2026-03-12 21:29:42,703 - app.main - INFO - Starting background poller...
 2026-03-12 21:29:42,703 - app.background_poller - INFO - Background poller task created
 2026-03-12 21:29:42,703 - app.main - INFO - Background poller started
 2026-03-12 21:29:42,703 - app.background_poller - INFO - Background polling loop started
 2026-03-12 21:29:42,708 - app.background_poller - INFO - [POOL] No active connections in pool