Compare commits

1 Commits

6 changed files with 5 additions and 437 deletions

1
.gitignore vendored
View File

@@ -1,6 +1,5 @@
/manuals/ /manuals/
/data/ /data/
/data-dev/
/SLM-stress-test/stress_test_logs/ /SLM-stress-test/stress_test_logs/
/SLM-stress-test/tcpdump-runs/ /SLM-stress-test/tcpdump-runs/

View File

@@ -1,403 +0,0 @@
# NL-43 + RX55 TCP “Wedge” Investigation (2255 Refusal) — Full Log & Next Steps
**Last updated:** 2026-02-18
**Owner:** Brian / serversdown
**Context:** Terra-View / SLMM / field-deployed Rion NL-43 behind Sierra Wireless RX55
---
## 0) What this document is
This is a **comprehensive, chronological** record of the debugging we did to isolate a failure where the **NL-43s TCP control port (2255) eventually stops accepting connections** (“wedges”), while other services (notably FTP/21) remain reachable.
This is written to be fed back into future troubleshooting, so it intentionally includes the **full reasoning chain, experiments, commands, packet evidence, and conclusions**.
---
## 1) Architecture (as tested)
### Network path
- **Server (SLMM host):** `10.0.0.40`
- **RX55 WAN IP:** `63.45.161.30`
- **RX55 LAN subnet:** `192.168.1.0/24`
- **RX55 LAN gateway:** `192.168.1.1`
- **NL-43 LAN IP:** `192.168.1.10` (confirmed via ARP OUI + ping; see LAN validation)
### RX55 details
- **Sierra Wireless RX55**
- **OS:** 5.2
- **Firmware:** `01.14.24.00`
- **Carrier:** Verizon LTE (Band 66)
### Port forwarding rules (RX55)
- **WAN:2255 → NL-43:2255** (NL-43 TCP control)
- **WAN:21 → NL-43:21** (NL-43 FTP control)
You also experimented with additional forwards:
- **WAN:2253 → NL-43:2255** (test)
- **WAN:2253 → NL-43:2253** (test)
- **WAN:4450 → NL-43:4450** (test)
**Important:** Rule “Input zone / interface” was set to **WAN-NAT**, and Source IP left as **Any IPv4**. This is correct for inbound port-forward behavior on Sierra OS 5.x.
---
## 2) Original problem statement (the “wedge”)
After running for hours, the NL-43 becomes unreachable over TCP control.
### Symptom signature (WAN-side)
- Client attempts to connect to `63.45.161.30:2255`
- Instead of timing out, the client gets **connection refused** quickly.
- Packet-level: SYN from client → **RST,ACK** back (meaning active refusal vs silent drop)
### Critical operational behavior
- **Power cycling the NL-43 fixes it.**
- **Power cycling the RX55 does NOT fix it.**
- FTP sometimes remains available even while TCP control (2255) is dead.
This combination is what forced us to determine whether:
- The RX55 is rejecting connections, OR
- The NL-43 is no longer listening on 2255, OR
- Something about the RX55 path triggers the NL-43s control listener to die.
---
## 3) Event timeline evidence (SLMM logs)
A concrete wedge window was observed on **2026-02-18**:
- 10:55:46 AM — Poll success (Start)
- 11:00:28 AM — Measurement STOPPED (scheduled stop/download cycle succeeded)
- 11:55:50 AM — Poll success (Stop)
- 12:55:55 PM — Poll success (Stop)
- **1:55:58 PM — Poll failed (attempt 1/3): Errno 111 (connection refused)**
- 2:56:02 PM — Poll failed (attempt 2/3): Errno 111 (connection refused)
Key interpretation:
- The wedge occurred sometime between **12:55 and 1:55**.
- The failure type is **refused**, not timeout.
---
## 4) Early hypotheses (before proof)
We considered two main buckets:
### A) NL-43-side failure (most suspicious)
- NL-43 TCP control service crashes / exits / unbinds from 2255
- socket leak / accept backlog exhaustion
- “single control session allowed” and it gets stuck thinking a session is active
- mode/service manager bug (service restart fails after other activities)
- firmware bug in TCP daemon
### B) RX55-side failure (possible trigger / less likely once FTP works)
- NAT/forwarding table corruption
- firewall behavior
- helper/ALG interference
- MSS/MTU weirdness causing edge-case behavior
- session churn behavior causing downstream issues
---
## 5) Key experiments and what they proved
### 5.1) LAN-only stability test (No RX55 path)
**Test:** NL-43 tested directly on LAN (no modem path involved).
- Ran **24+ hours**
- Scheduler start/stop cycles worked
- Stress test: **500 commands @ 1/sec** → no failure
- Response time trend decreased (not degrading)
**Result:** The NL-43 appears stable in a “pure LAN” environment.
**Interpretation:** The trigger is likely related to the RX55/WAN environment, connection patterns, or service switching patterns—not just simple uptime.
---
### 5.2) Port-forward behavior: timeout vs refused (RX55 behavior characterization)
You observed:
- **If a WAN port is NOT forwarded (no rule):** connecting to that port **times out** (silent drop)
- **If a WAN port IS forwarded to NL-43 but nothing listens:** it **actively refuses** (RST)
Concrete example:
- Port **4450** with no rule → timeout
- Port **4450 → NL-43:4450** rule created → connection refused
**Interpretation:** This confirms the RX55 is actually forwarding packets to the NL-43 when a rule exists. “Refused” is consistent with the NL-43 (or RX55 relay behavior) responding quickly because the packet reached the target.
Important nuance:
- A “refused” on forwarded ports does **not** automatically prove the NL-43 is the one generating RST, because NAT hides the inside host and the RX55 could reject on behalf of an unreachable target. We needed a LAN-side proof test to close the loop.
---
### 5.3) UDP test confusion (and resolution)
You ran:
```bash
nc -vzu 63.45.161.30 2255
nc -vz 63.45.161.30 2255
```
Observed:
- UDP: “succeeded”
- TCP: “connection refused”
Resolution:
- UDP has **no handshake**. netcat prints “succeeded” if it doesnt immediately receive an ICMP unreachable. It does **not** mean a UDP service exists.
- TCP refused is meaningful: a RST implies “no listener” or “actively rejected.”
**Net effect:** UDP test did not change the diagnosis.
---
### 5.4) Packet capture proof (WAN-side)
You captured a Wireshark/tcpdump summary with these key patterns:
#### Port 2255 (TCP control)
Example:
- `10.0.0.40 → 63.45.161.30:2255` SYN
- `63.45.161.30 → 10.0.0.40` **RST, ACK** within ~50ms
This happened repeatedly.
#### Port 2253 (test port)
Multiple SYN attempts to 2253 showed **retransmissions and no response**, i.e., **silent drop** (consistent with no rule or not forwarded at that moment).
#### Port 21 (FTP)
Clean 3-way handshake:
- SYN → SYN/ACK → ACK
Then:
- FTP server banner: `220 Connection Ready`
Then:
- `530 Not logged in` (because SLMM was sending non-FTP “requests” as an experiment)
Session closes cleanly.
**Key takeaway from capture:**
- TCP transport to NL-43 via RX55 is definitely working (port 21 proves it).
- Port 2255 is being actively refused.
This strongly suggested “2255 listener is gone,” but still didnt fully prove whether the refusal was generated internally by NL-43 or by RX55 on behalf of NL-43.
---
## 6) The decisive experiment: LAN-side test while wedged (final proof)
Because the RX55 does not offer SSH, the plan was to test from **inside the LAN behind the RX55**.
### 6.1) Physical LAN tap setup
Constraint:
- NL-43 has only one Ethernet port.
Solution:
- Insert an unmanaged switch:
- RX55 LAN → switch
- NL-43 → switch
- Windows 10 laptop → switch
This creates a shared L2 segment where the laptop can test NL-43 directly.
### 6.2) Windows LAN validation
On the Windows laptop:
- `ipconfig` showed:
- IP: `192.168.1.100`
- Gateway: `192.168.1.1` (RX55)
- Initial `arp -a` only showed RX55, not NL-43.
You then:
- pinged likely host addresses and discovered NL-43 responds on **192.168.1.10**
- `arp -a` then showed:
- `192.168.1.10 → 00-10-50-14-0a-d8`
- OUI `00-10-50` recognized as **Rion** (matches NL-43)
So LAN identities were confirmed:
- RX55: `192.168.1.1`
- NL-43: `192.168.1.10`
### 6.3) The LAN port tests (the smoking gun)
From Windows:
```powershell
Test-NetConnection -ComputerName 192.168.1.10 -Port 2255
Test-NetConnection -ComputerName 192.168.1.10 -Port 21
```
Results (while the unit was “wedged” from the WAN perspective):
- **2255:** `TcpTestSucceeded : False`
- **21:** `TcpTestSucceeded : True`
**Conclusion (PROVEN):**
- The NL-43 is reachable on the LAN
- FTP port 21 is alive
- **The NL-43 is NOT listening on TCP port 2255**
- Therefore the RX55 is not the root cause of the refusal. The WAN refusal is consistent with the NL-43 having no listener on 2255.
This is now settled.
---
## 7) What we learned (final conclusions)
### 7.1) RX55 innocence (for this failure mode)
The RX55 is not “randomly rejecting” or “breaking TCP” in the way originally feared.
It successfully forwards and supports TCP to the NL-43 on port 21, and the LAN-side test proves the 2255 failure exists *even without NAT/WAN involvement*.
### 7.2) NL-43 control listener failure
The NL-43s TCP control service (port 2255) stops listening while:
- the device remains alive
- the LAN stack remains alive (ping)
- FTP remains alive (port 21)
This looks like one of:
- control daemon crash/exit
- service unbind
- stuck service state (e.g., “busy” / “session active forever”)
- resource leak (sockets/file descriptors) specific to the control service
- firmware service manager bug (start/stop of services fails after certain sequences)
---
## 8) Additional constraint discovered: “Web App mode” conflicts
You noted an important operational constraint:
> Turning on the web app disables other interfaces like TCP and FTP.
Meaning the NL-43 appears to have mutually exclusive service/mode behavior (or at least serious conflicts). That matters because:
- If any workflow toggles modes (explicitly or implicitly), it could destabilize the service lifecycle.
- It reduces the possibility of using “web UI toggle” as an easy remote recovery mechanism **if** it disables the services needed.
We have not yet run a controlled long test to determine whether:
- mode switching contributes directly to the 2255 listener dying, OR
- it happens even in a pure TCP-only mode with no switching.
---
## 9) Immediate operational decision (field tomorrow)
Because the device is needed in the field immediately, you chose:
- **Old-school manual deployment**
- **Manual SD card downloads**
- Avoid reliance on 2255/TCP control and remote workflows for now.
**Important operational note:**
The 2255 listener dying does not necessarily stop the NL-43 from measuring; it primarily breaks remote control/polling. Manual SD workflow sidesteps the entire remote control dependency.
---
## 10) Whats next (future work — when the unit is back)
Because long tests cant be run before tomorrow, the plan is to resume in a few weeks with controlled experiments designed to isolate the trigger and develop an operational mitigation.
### 10.1) Controlled experiment matrix (recommended)
Run each test for 2472 hours, or until wedge occurs, and record:
- number of TCP connects
- whether connections are persistent
- whether FTP is used
- whether any mode toggling is performed
- time-to-wedge
#### Test A — TCP-only (ideal baseline)
- TCP control only (2255)
- **True persistent connection** (open once, keep forever)
- No FTP
- No web mode toggling
Outcome interpretation:
- If stable: connection churn and/or FTP/mode switching is the trigger.
- If wedges anyway: pure 2255 daemon leak/bug.
#### Test B — TCP with connection churn
- Same as A but intentionally reconnect on a schedule (current SLMM behavior)
- No FTP
Outcome:
- If this wedges but A doesnt: churn is the trigger.
#### Test C — FTP activity + TCP
- Introduce scheduled FTP sessions (downloads) while using TCP control
- Observe whether wedge correlates with FTP use or with post-download periods.
Outcome:
- If wedge correlates with FTP, suspect internal service lifecycle conflict.
#### Test D — Web mode interaction (only if safe/possible)
- Evaluate what toggling web mode does to TCP/FTP services.
- Determine if any remote-safe “soft reset” exists.
---
## 11) Mitigation options (ranked)
### Option 1 — Make SLMM truly persistent (highest probability of success)
If the NL-43 wedges due to session churn or leaked socket states, the best mitigation is:
- Open one TCP socket per device
- Keep it open indefinitely
- Use OS keepalive
- Do **not** rotate connections on timers
- Reconnect only when the socket actually dies
This reduces:
- connect/close cycles
- NAT edge-case exposure
- resource churn inside NL-43
### Option 2 — Service “soft reset” (if possible without disabling required services)
If there exists any way to restart the 2255 service without power cycling:
- LAN TCP toggle (if it doesnt require web mode)
- any “restart comms” command (unknown)
- any maintenance menu sequence
then SLMM could:
- detect wedge
- trigger soft reset
- recover automatically
Current constraint: web app mode appears to disable other services, so this may not be viable.
### Option 3 — Hardware watchdog power cycle (industrial but reliable)
If this is a firmware bug with no clean workaround:
- Add a remotely controlled relay/power switch
- On wedge detection, power-cycle NL-43 automatically
- Optionally schedule a nightly power cycle to prevent leak accumulation
This is “field reality” and often the only long-term move with embedded devices.
### Option 4 — Vendor escalation (Rion)
You now have excellent evidence:
- LAN-side proof: 2255 dead while 21 alive
- WAN packet evidence
- clear isolation of RX55 innocence
This is strong enough to send to Rion support as a firmware defect report.
---
## 12) Repro “wedge bundle” checklist (for future captures)
When the wedge happens again, capture these before power cycling:
1) From server:
- `nc -vz 63.45.161.30 2255` (expect refused)
- `nc -vz 63.45.161.30 21` (expect success if FTP alive)
2) From LAN side (via switch/laptop):
- `Test-NetConnection 192.168.1.10 -Port 2255`
- `Test-NetConnection 192.168.1.10 -Port 21`
3) Optional: packet capture around the refused attempt.
4) Record:
- last successful poll timestamp
- last FTP session timestamp
- any scheduled start/stop/download cycles near wedge time
- SLMM connection reuse/rotation settings in effect
---
## 13) Final, current-state summary (as of 2026-02-18)
- The issue is **NOT** the RX55 rejecting inbound connections.
- The NL-43 is **alive**, reachable on LAN, and FTP works.
- The NL-43s **TCP control listener on 2255 stops listening** while the device remains otherwise healthy.
- The wedge can occur hours after successful operations.
- The unit is needed in the field immediately, so investigation pauses.
- Next phase: controlled tests to isolate trigger + implement mitigation (persistent socket or watchdog reset).
---
## 14) Notes / misc observations
- The Wireshark trace showed repeated FTP sessions were opened and closed cleanly, but SLMMs “FTP requests” were not valid FTP (causing `530 Not logged in`). That was part of experimentation, not a normal workflow.
- UDP “success” via netcat is not meaningful because UDP has no handshake; it simply indicates no ICMP unreachable was returned.
---
**End of document.**

View File

@@ -38,7 +38,6 @@ class BackgroundPoller:
self._running = False self._running = False
self._logger = logger self._logger = logger
self._last_cleanup = None # Track last log cleanup time self._last_cleanup = None # Track last log cleanup time
self._last_pool_log = None # Track last connection pool heartbeat log
async def start(self): async def start(self):
"""Start the background polling task.""" """Start the background polling task."""
@@ -90,24 +89,6 @@ class BackgroundPoller:
except Exception as e: except Exception as e:
self._logger.warning(f"Log cleanup failed: {e}") self._logger.warning(f"Log cleanup failed: {e}")
# Log connection pool status every 15 minutes
try:
now = datetime.utcnow()
if self._last_pool_log is None or (now - self._last_pool_log).total_seconds() > 900:
from app.services import _connection_pool
stats = _connection_pool.get_stats()
conns = stats.get("connections", {})
if conns:
for key, c in conns.items():
self._logger.info(
f"[POOL] {key} — age={c['age_seconds']}s idle={c['idle_seconds']}s alive={c['alive']}"
)
else:
self._logger.info("[POOL] No active connections in pool")
self._last_pool_log = now
except Exception as e:
self._logger.warning(f"Pool status log failed: {e}")
# Calculate dynamic sleep interval # Calculate dynamic sleep interval
sleep_time = self._calculate_sleep_interval() sleep_time = self._calculate_sleep_interval()
self._logger.debug(f"Sleeping for {sleep_time} seconds until next poll cycle") self._logger.debug(f"Sleeping for {sleep_time} seconds until next poll cycle")

View File

@@ -338,14 +338,14 @@ class ConnectionPool:
if self._is_alive(conn): if self._is_alive(conn):
self._drain_buffer(conn.reader) self._drain_buffer(conn.reader)
conn.last_used_at = time.time() conn.last_used_at = time.time()
logger.info(f"Pool hit for {device_key} (age={time.time() - conn.created_at:.0f}s)") logger.debug(f"Pool hit for {device_key} (age={time.time() - conn.created_at:.0f}s)")
return conn.reader, conn.writer, True return conn.reader, conn.writer, True
else: else:
await self._close_connection(conn, reason="stale") await self._close_connection(conn, reason="stale")
# Open fresh connection # Open fresh connection
reader, writer = await self._open_connection(host, port, timeout) reader, writer = await self._open_connection(host, port, timeout)
logger.info(f"New connection opened for {device_key}") logger.debug(f"New connection opened for {device_key}")
return reader, writer, False return reader, writer, False
async def release(self, device_key: str, reader: asyncio.StreamReader, writer: asyncio.StreamWriter, host: str, port: int): async def release(self, device_key: str, reader: asyncio.StreamReader, writer: asyncio.StreamWriter, host: str, port: int):
@@ -454,11 +454,11 @@ class ConnectionPool:
"""Check whether a cached connection is still usable.""" """Check whether a cached connection is still usable."""
now = time.time() now = time.time()
# Age / idle checks (value of -1 disables the check) # Age / idle checks
if self._idle_ttl >= 0 and now - conn.last_used_at > self._idle_ttl: if now - conn.last_used_at > self._idle_ttl:
logger.debug(f"Connection {conn.device_key} idle too long ({now - conn.last_used_at:.0f}s > {self._idle_ttl}s)") logger.debug(f"Connection {conn.device_key} idle too long ({now - conn.last_used_at:.0f}s > {self._idle_ttl}s)")
return False return False
if self._max_age >= 0 and now - conn.created_at > self._max_age: if now - conn.created_at > self._max_age:
logger.debug(f"Connection {conn.device_key} too old ({now - conn.created_at:.0f}s > {self._max_age}s)") logger.debug(f"Connection {conn.device_key} too old ({now - conn.created_at:.0f}s > {self._max_age}s)")
return False return False

Binary file not shown.

View File

@@ -1,9 +0,0 @@
2026-03-12 21:29:42,683 - app.main - INFO - Database tables initialized
2026-03-12 21:29:42,684 - app.main - INFO - CORS allowed origins: ['*']
2026-03-12 21:29:42,703 - app.main - INFO - Starting TCP connection pool cleanup task...
2026-03-12 21:29:42,703 - app.services - INFO - Connection pool cleanup task started
2026-03-12 21:29:42,703 - app.main - INFO - Starting background poller...
2026-03-12 21:29:42,703 - app.background_poller - INFO - Background poller task created
2026-03-12 21:29:42,703 - app.main - INFO - Background poller started
2026-03-12 21:29:42,703 - app.background_poller - INFO - Background polling loop started
2026-03-12 21:29:42,708 - app.background_poller - INFO - [POOL] No active connections in pool