# Runbook — Recovering a wedged unit stuck in a call-home loop **Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17. A field unit with a stuck-triggered geophone (or any hardware fault causing constant event triggering) will record events back-to-back, and if Auto Call Home is set to "After Event Recorded" the device will dial the office BW ACH server in a tight loop. Combined with a Sierra Wireless modem in bidirectional serial-TCP mode, this makes the unit effectively unreachable from SFM — every TCP connection we open gets killed when the modem flips from server-mode to client-mode to honor the device's next AT dial command. This runbook describes how to break the loop and recover control. --- ## Symptoms - Terra-View / SFM `/device/info` either hangs or fails on `count_events()`. - `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout waiting for POLL response) or 503 (TCP connect refused). - ACEmanager serial log shows repeating `Connect to IP: Port: ` → `Shutdown TCP socket` cycles every 30-60 seconds. - Spam-mode endpoints (`/device/stop_monitoring_spam`) report many `sent_ok` but the device's monitoring state never changes. - `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble but before completing the drip loop. If you see *all* of these, the unit is in this exact failure mode. --- ## Quick reference — how to recover You need **ACEmanager access** to the unit's modem. ### Step 1: stop the modem's mode-flipping In ACEmanager → **Serial → Port Configuration**: | Field | Set to | |---|---| | **Destination Address** | clear (blank) | | **Destination Port** | `0` | Click **Apply**. This removes the modem's auto-dial-out target. The device's AT dial commands now error back at the modem instead of triggering a mode-flip, so the modem stays in TCP-server mode permanently and our inbound TCP sessions stay alive. *(Optional belt-and-suspenders: also add the BW server's port to **Security → Port Filtering - Outbound** as a blocked port, with Outbound Port Filtering Mode = Blocked Ports.)* ### Step 2: stop monitoring on the device (slow drip) From the SFM host: ```bash /home/serversdown/seismo-relay/scripts/slow_drip.sh ``` Defaults are 120s duration with a drip every 3s. Watch the response: - `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓ - `bytes_received > 0` → device is responding ✓ (this is the success signal) If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take hold — re-check ACEmanager, may need to reboot the modem after Apply. ### Step 3: confirm monitoring stopped ```bash curl 'http://localhost:8200/device/monitor/status?host=&tcp_port=&force=true' # expect: {"is_monitoring": false, ...} ``` ### Step 4: disable ACH at the device level + erase corrupted events Either fire the rescue endpoint: ```bash /home/serversdown/seismo-relay/scripts/rescue_device.sh ``` Or do the two steps manually: ```bash # Disable ACH in the device's compliance config curl -X POST 'http://localhost:8200/device/call_home?host=&tcp_port=' \ -H 'Content-Type: application/json' \ -d '{"auto_call_home_enabled": false}' # Erase corrupted event chain curl -X POST 'http://localhost:8200/device/events/erase?host=&tcp_port=' ``` You can also do this via the SFM standalone UI → **Call Home** tab → set `Enable Auto Call Home` to `Disabled` → **Write to Device**. ### Step 5: restore modem config (housekeeping) Once the device-side ACH is disabled, restore the modem's Destination Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in ACEmanager. The modem will resume normal bidirectional behavior, but the unit won't issue any dial commands until ACH is explicitly re-enabled on the device. ### Step 6: do NOT re-enable ACH on this unit until the underlying hardware fault is repaired. If you do, the call-home loop starts again immediately and you'll be running this runbook a second time. --- ## Why this works — the failure mode explained The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes at any moment: - **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound TCP to the device's serial port. This is what we need to interact with the device. - **Client mode** — when the device sends an AT dial command on its serial TX line, the modem opens an outbound TCP to `Destination Address:Port` and bridges that to serial. A serial port in this configuration is **bidirectional**: the modem flips between server and client modes on demand. When the device's firmware is healthy and only dials occasionally, this works fine. When the unit is constantly triggering events and ACH is set to "After Event Recorded", the device sends an AT dial command every few seconds. Each one causes the modem to: 1. Drop any active inbound TCP session 2. Flip to client mode 3. Attempt outbound TCP to `Destination Address:Port` 4. Hang for up to a minute waiting for it to succeed/fail 5. Drop back to server mode **During the entire hang, no inbound TCP can establish.** Even between hangs, the modem closes any existing inbound session before flipping. So any tool that needs more than a few seconds of held TCP (e.g. POLL + config read + write) gets repeatedly kicked off. Clearing `Destination Address` removes step 3-4 from the cycle: the modem has nowhere to dial, so it doesn't flip modes when it receives an AT dial command. The serial port effectively becomes server-only, and inbound TCP sessions can stay open as long as needed. **This is a modem-layer issue, not a device firmware issue.** The device is alive and responsive the whole time — confirmed in the BE9558H recovery by 990 bytes of S3 responses received over a 120s slow-drip session once the modem was no longer mode-flipping. --- ## Why simpler approaches don't work | Approach | Why it fails | |---|---| | Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario | | `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete | | `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) | | `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side | | Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. | | Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds | The combination of `slow_drip` + cleared `Destination Address` works because: 1. The modem stops mode-flipping → TCP session stays open for the full drip duration 2. Slow drip rate → device's UART RX FIFO never overflows even if firmware is busy with event recording 3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many independent chances for the parser to land one valid frame 4. Once one Stop Monitoring is parsed, event recording halts → firmware has CPU to spare → subsequent operations are trivially easy --- ## Tooling reference All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in `seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`), overridable via `SFM_BASE_URL`. ### Endpoints added during BE9558H recovery | Endpoint | Purpose | |---|---| | `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. | | `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. | | `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. | | `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. | | `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). | | `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. | | `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** | Also changed: default protocol recv timeout dropped from 30s → 10s in `_build_client`. Added `connect_timeout` knob to same. Cleaned up unhandled-exception path in `/device/monitor/status` so it returns 502 instead of 500 on protocol timeouts. ### Scripts | Script | Purpose | |---|---| | `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` | | `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` | | `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) | | `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) | | `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file | --- ## Incident log — BE9558H, 2026-05-16/17 What was wrong: Long-axis geophone developed an offset, constantly above trigger threshold → constant event recording → after-event ACH set → modem dialing office BW server (`50.197.32.92:12345`) every 30-60s. Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`). Diagnostic path: 1. `/device/info` slow, choked on event walk 2. Built lightweight probe endpoints (`storage_range`, `index`) — useful but didn't reach the wedged unit 3. Built `/device/rescue` with short timeouts — got 502 (POLL no response) 4. Built `/device/stop_monitoring_blind` — first version was a false positive (no POLL preamble); fixed by including `SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump 5. Verified blind stop works on bench unit 6. Built `/device/stop_monitoring_spam` — 420 successful sends over 5 min, zero behavior change on field unit 7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s, confirmed device was not fully locked up 8. Added outbound port-12345 firewall block → outbound attempts now fail instantly but contention persisted 9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with broken pipe (modem closing on us) 10. Looked at full ACEmanager Port Configuration → **found `Destination Address: 50.197.32.92` configured**, realized every AT dial command was triggering a modem mode-flip that killed our inbound 11. Cleared Destination Address + Port → slow_drip held 120s, device responded with 990 bytes, 39 stop commands acked 12. Disabled ACH at device level via `/device/call_home`, erased events Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at device level, modem destination cleared (to be restored after physical service). Total time from "i was wondering if its possible to" first attempt to recovery: ~7 hours of intermittent debugging across one evening.