diff --git a/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md b/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md new file mode 100644 index 0000000..24a4d22 --- /dev/null +++ b/SLM-stress-test/NL43_RX55_TCP_Wedge_Investigation_2026-02-18.md @@ -0,0 +1,403 @@ +# NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps +**Last updated:** 2026-02-18 +**Owner:** Brian / serversdown +**Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55 + +--- + +## 0) What this document is +This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43’s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable. + +This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**. + +--- + +## 1) Architecture (as tested) +### Network path +- **Server (SLMM host):** `10.0.0.40` +- **RX55 WAN IP:** `63.45.161.30` +- **RX55 LAN subnet:** `192.168.1.0/24` +- **RX55 LAN gateway:** `192.168.1.1` +- **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation) + +### RX55 details +- **Sierra Wireless RX55** +- **OS:** 5.2 +- **Firmware:** `01.14.24.00` +- **Carrier:** Verizon LTE (Band 66) + +### Port forwarding rules (RX55) +- **WAN:2255 → NL-43:2255** (NL-43 TCP control) +- **WAN:21 → NL-43:21** (NL-43 FTP control) + +You also experimented with additional forwards: +- **WAN:2253 → NL-43:2255** (test) +- **WAN:2253 → NL-43:2253** (test) +- **WAN:4450 → NL-43:4450** (test) + +**Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x. + +--- + +## 2) Original problem statement (the “wedge”) +After running for hours, the NL-43 becomes unreachable over TCP control. + +### Symptom signature (WAN-side) +- Client attempts to connect to `63.45.161.30:2255` +- Instead of timing out, the client gets **connection refused** quickly. +- Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop) + +### Critical operational behavior +- **Power cycling the NL-43 fixes it.** +- **Power cycling the RX55 does NOT fix it.** +- FTP sometimes remains available even while TCP control (2255) is dead. + +This combination is what forced us to determine whether: +- The RX55 is rejecting connections, OR +- The NL-43 is no longer listening on 2255, OR +- Something about the RX55 path triggers the NL-43’s control listener to die. + +--- + +## 3) Event timeline evidence (SLMM logs) +A concrete wedge window was observed on **2026-02-18**: + +- 10:55:46 AM — Poll success (Start) +- 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded) +- 11:55:50 AM — Poll success (Stop) +- 12:55:55 PM — Poll success (Stop) +- **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)** +- 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused) + +Key interpretation: +- The wedge occurred sometime between **12:55 and 1:55**. +- The failure type is **refused**, not timeout. + +--- + +## 4) Early hypotheses (before proof) +We considered two main buckets: + +### A) NL-43-side failure (most suspicious) +- NL-43 TCP control service crashes / exits / unbinds from 2255 +- socket leak / accept backlog exhaustion +- “single control session allowed” and it gets stuck thinking a session is active +- mode/service manager bug (service restart fails after other activities) +- firmware bug in TCP daemon + +### B) RX55-side failure (possible trigger / less likely once FTP works) +- NAT/forwarding table corruption +- firewall behavior +- helper/ALG interference +- MSS/MTU weirdness causing edge-case behavior +- session churn behavior causing downstream issues + +--- + +## 5) Key experiments and what they proved + +### 5.1) LAN-only stability test (No RX55 path) +**Test:** NL-43 tested directly on LAN (no modem path involved). +- Ran **24+ hours** +- Scheduler start/stop cycles worked +- Stress test: **500 commands @ 1/sec** → no failure +- Response time trend decreased (not degrading) + +**Result:** The NL-43 appears stable in a “pure LAN” environment. + +**Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime. + +--- + +### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization) +You observed: + +- **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop) +- **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST) + +Concrete example: +- Port **4450** with no rule → timeout +- Port **4450 → NL-43:4450** rule created → connection refused + +**Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target. + +Important nuance: +- A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop. + +--- + +### 5.3) UDP test confusion (and resolution) +You ran: + +```bash +nc -vzu 63.45.161.30 2255 +nc -vz 63.45.161.30 2255 +``` + +Observed: +- UDP: “succeeded” +- TCP: “connection refused” + +Resolution: +- UDP has **no handshake**. netcat prints “succeeded” if it doesn’t immediately receive an ICMP unreachable. It does **not** mean a UDP service exists. +- TCP refused is meaningful: a RST implies “no listener” or “actively rejected.” + +**Net effect:** UDP test did not change the diagnosis. + +--- + +### 5.4) Packet capture proof (WAN-side) +You captured a Wireshark/tcpdump summary with these key patterns: + +#### Port 2255 (TCP control) +Example: +- `10.0.0.40 → 63.45.161.30:2255` SYN +- `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms + +This happened repeatedly. + +#### Port 2253 (test port) +Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment). + +#### Port 21 (FTP) +Clean 3-way handshake: +- SYN → SYN/ACK → ACK +Then: +- FTP server banner: `220 Connection Ready` +Then: +- `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment) +Session closes cleanly. + +**Key takeaway from capture:** +- TCP transport to NL-43 via RX55 is definitely working (port 21 proves it). +- Port 2255 is being actively refused. + +This strongly suggested “2255 listener is gone,” but still didn’t fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43. + +--- + +## 6) The decisive experiment: LAN-side test while wedged (final proof) +Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**. + +### 6.1) Physical LAN tap setup +Constraint: +- NL-43 has only one Ethernet port. + +Solution: +- Insert an unmanaged switch: + - RX55 LAN → switch + - NL-43 → switch + - Windows 10 laptop → switch + +This creates a shared L2 segment where the laptop can test NL-43 directly. + +### 6.2) Windows LAN validation +On the Windows laptop: + +- `ipconfig` showed: + - IP: `192.168.1.100` + - Gateway: `192.168.1.1` (RX55) +- Initial `arp -a` only showed RX55, not NL-43. + +You then: +- pinged likely host addresses and discovered NL-43 responds on **192.168.1.10** +- `arp -a` then showed: + - `192.168.1.10 → 00-10-50-14-0a-d8` + - OUI `00-10-50` recognized as **Rion** (matches NL-43) + +So LAN identities were confirmed: +- RX55: `192.168.1.1` +- NL-43: `192.168.1.10` + +### 6.3) The LAN port tests (the smoking gun) +From Windows: + +```powershell +Test-NetConnection -ComputerName 192.168.1.10 -Port 2255 +Test-NetConnection -ComputerName 192.168.1.10 -Port 21 +``` + +Results (while the unit was “wedged” from the WAN perspective): +- **2255:** `TcpTestSucceeded : False` +- **21:** `TcpTestSucceeded : True` + +**Conclusion (PROVEN):** +- The NL-43 is reachable on the LAN +- FTP port 21 is alive +- **The NL-43 is NOT listening on TCP port 2255** +- Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255. + +This is now settled. + +--- + +## 7) What we learned (final conclusions) +### 7.1) RX55 innocence (for this failure mode) +The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared. + +It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*. + +### 7.2) NL-43 control listener failure +The NL-43’s TCP control service (port 2255) stops listening while: +- the device remains alive +- the LAN stack remains alive (ping) +- FTP remains alive (port 21) + +This looks like one of: +- control daemon crash/exit +- service unbind +- stuck service state (e.g., “busy” / “session active forever”) +- resource leak (sockets/file descriptors) specific to the control service +- firmware service manager bug (start/stop of services fails after certain sequences) + +--- + +## 8) Additional constraint discovered: “Web App mode” conflicts +You noted an important operational constraint: + +> Turning on the web app disables other interfaces like TCP and FTP. + +Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because: +- If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle. +- It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed. + +We have not yet run a controlled long test to determine whether: +- mode switching contributes directly to the 2255 listener dying, OR +- it happens even in a pure TCP-only mode with no switching. + +--- + +## 9) Immediate operational decision (field tomorrow) +Because the device is needed in the field immediately, you chose: +- **Old-school manual deployment** +- **Manual SD card downloads** +- Avoid reliance on 2255/TCP control and remote workflows for now. + +**Important operational note:** +The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency. + +--- + +## 10) What’s next (future work — when the unit is back) +Because long tests can’t be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation. + +### 10.1) Controlled experiment matrix (recommended) +Run each test for 24–72 hours, or until wedge occurs, and record: +- number of TCP connects +- whether connections are persistent +- whether FTP is used +- whether any mode toggling is performed +- time-to-wedge + +#### Test A — TCP-only (ideal baseline) +- TCP control only (2255) +- **True persistent connection** (open once, keep forever) +- No FTP +- No web mode toggling + +Outcome interpretation: +- If stable: connection churn and/or FTP/mode switching is the trigger. +- If wedges anyway: pure 2255 daemon leak/bug. + +#### Test B — TCP with connection churn +- Same as A but intentionally reconnect on a schedule (current SLMM behavior) +- No FTP + +Outcome: +- If this wedges but A doesn’t: churn is the trigger. + +#### Test C — FTP activity + TCP +- Introduce scheduled FTP sessions (downloads) while using TCP control +- Observe whether wedge correlates with FTP use or with post-download periods. + +Outcome: +- If wedge correlates with FTP, suspect internal service lifecycle conflict. + +#### Test D — Web mode interaction (only if safe/possible) +- Evaluate what toggling web mode does to TCP/FTP services. +- Determine if any remote-safe “soft reset” exists. + +--- + +## 11) Mitigation options (ranked) +### Option 1 — Make SLMM truly persistent (highest probability of success) +If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is: +- Open one TCP socket per device +- Keep it open indefinitely +- Use OS keepalive +- Do **not** rotate connections on timers +- Reconnect only when the socket actually dies + +This reduces: +- connect/close cycles +- NAT edge-case exposure +- resource churn inside NL-43 + +### Option 2 — Service “soft reset” (if possible without disabling required services) +If there exists any way to restart the 2255 service without power cycling: +- LAN TCP toggle (if it doesn’t require web mode) +- any “restart comms” command (unknown) +- any maintenance menu sequence +then SLMM could: +- detect wedge +- trigger soft reset +- recover automatically + +Current constraint: web app mode appears to disable other services, so this may not be viable. + +### Option 3 — Hardware watchdog power cycle (industrial but reliable) +If this is a firmware bug with no clean workaround: +- Add a remotely controlled relay/power switch +- On wedge detection, power-cycle NL-43 automatically +- Optionally schedule a nightly power cycle to prevent leak accumulation + +This is “field reality” and often the only long-term move with embedded devices. + +### Option 4 — Vendor escalation (Rion) +You now have excellent evidence: +- LAN-side proof: 2255 dead while 21 alive +- WAN packet evidence +- clear isolation of RX55 innocence + +This is strong enough to send to Rion support as a firmware defect report. + +--- + +## 12) Repro “wedge bundle” checklist (for future captures) +When the wedge happens again, capture these before power cycling: + +1) From server: +- `nc -vz 63.45.161.30 2255` (expect refused) +- `nc -vz 63.45.161.30 21` (expect success if FTP alive) + +2) From LAN side (via switch/laptop): +- `Test-NetConnection 192.168.1.10 -Port 2255` +- `Test-NetConnection 192.168.1.10 -Port 21` + +3) Optional: packet capture around the refused attempt. + +4) Record: +- last successful poll timestamp +- last FTP session timestamp +- any scheduled start/stop/download cycles near wedge time +- SLMM connection reuse/rotation settings in effect + +--- + +## 13) Final, current-state summary (as of 2026-02-18) +- The issue is **NOT** the RX55 rejecting inbound connections. +- The NL-43 is **alive**, reachable on LAN, and FTP works. +- The NL-43’s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy. +- The wedge can occur hours after successful operations. +- The unit is needed in the field immediately, so investigation pauses. +- Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset). + +--- + +## 14) Notes / misc observations +- The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMM’s “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow. +- UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned. + +--- + +**End of document.** diff --git a/app/services.py b/app/services.py index 860bffa..3e524f1 100644 --- a/app/services.py +++ b/app/services.py @@ -454,11 +454,11 @@ class ConnectionPool: """Check whether a cached connection is still usable.""" now = time.time() - # Age / idle checks - if now - conn.last_used_at > self._idle_ttl: + # Age / idle checks (value of -1 disables the check) + if self._idle_ttl >= 0 and now - conn.last_used_at > self._idle_ttl: logger.debug(f"Connection {conn.device_key} idle too long ({now - conn.last_used_at:.0f}s > {self._idle_ttl}s)") return False - if now - conn.created_at > self._max_age: + if self._max_age >= 0 and now - conn.created_at > self._max_age: logger.debug(f"Connection {conn.device_key} too old ({now - conn.created_at:.0f}s > {self._max_age}s)") return False