1fff8179d6
- Created a comprehensive runbook (`wedged_unit_recovery.md`) detailing the recovery process for units stuck in a call-home loop, including symptoms, recovery steps, and explanations of the failure mode. - Added `blind_stop.sh` script to send stop-monitoring commands in a tight loop for unresponsive devices. - Introduced `rescue_device.sh` script to disable Auto Call Home and erase events from a busy device. - Implemented `slow_drip.sh` script to send stop-monitoring frames at a slow rate to prevent UART overrun. - Developed `spam_stop.sh` script to rapidly send stop-monitoring commands to a device. - Created `watch_unit.sh` script for passive monitoring of device reachability, logging results over time.
256 lines
11 KiB
Markdown
256 lines
11 KiB
Markdown
# Runbook — Recovering a wedged unit stuck in a call-home loop
|
|
|
|
**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
|
|
|
|
A field unit with a stuck-triggered geophone (or any hardware fault causing
|
|
constant event triggering) will record events back-to-back, and if Auto Call
|
|
Home is set to "After Event Recorded" the device will dial the office BW
|
|
ACH server in a tight loop. Combined with a Sierra Wireless modem in
|
|
bidirectional serial-TCP mode, this makes the unit effectively unreachable
|
|
from SFM — every TCP connection we open gets killed when the modem flips
|
|
from server-mode to client-mode to honor the device's next AT dial command.
|
|
|
|
This runbook describes how to break the loop and recover control.
|
|
|
|
---
|
|
|
|
## Symptoms
|
|
|
|
- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
|
|
- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
|
|
waiting for POLL response) or 503 (TCP connect refused).
|
|
- ACEmanager serial log shows repeating
|
|
`Connect to IP: <BW_IP> Port: <BW_PORT>` → `Shutdown TCP socket` cycles
|
|
every 30-60 seconds.
|
|
- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
|
|
`sent_ok` but the device's monitoring state never changes.
|
|
- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
|
|
but before completing the drip loop.
|
|
|
|
If you see *all* of these, the unit is in this exact failure mode.
|
|
|
|
---
|
|
|
|
## Quick reference — how to recover
|
|
|
|
You need **ACEmanager access** to the unit's modem.
|
|
|
|
### Step 1: stop the modem's mode-flipping
|
|
|
|
In ACEmanager → **Serial → Port Configuration**:
|
|
|
|
| Field | Set to |
|
|
|---|---|
|
|
| **Destination Address** | clear (blank) |
|
|
| **Destination Port** | `0` |
|
|
|
|
Click **Apply**. This removes the modem's auto-dial-out target. The device's
|
|
AT dial commands now error back at the modem instead of triggering a
|
|
mode-flip, so the modem stays in TCP-server mode permanently and our inbound
|
|
TCP sessions stay alive.
|
|
|
|
*(Optional belt-and-suspenders: also add the BW server's port to
|
|
**Security → Port Filtering - Outbound** as a blocked port, with
|
|
Outbound Port Filtering Mode = Blocked Ports.)*
|
|
|
|
### Step 2: stop monitoring on the device (slow drip)
|
|
|
|
From the SFM host:
|
|
|
|
```bash
|
|
/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
|
|
```
|
|
|
|
Defaults are 120s duration with a drip every 3s. Watch the response:
|
|
|
|
- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
|
|
- `bytes_received > 0` → device is responding ✓ (this is the success signal)
|
|
|
|
If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
|
|
hold — re-check ACEmanager, may need to reboot the modem after Apply.
|
|
|
|
### Step 3: confirm monitoring stopped
|
|
|
|
```bash
|
|
curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
|
|
# expect: {"is_monitoring": false, ...}
|
|
```
|
|
|
|
### Step 4: disable ACH at the device level + erase corrupted events
|
|
|
|
Either fire the rescue endpoint:
|
|
|
|
```bash
|
|
/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
|
|
```
|
|
|
|
Or do the two steps manually:
|
|
|
|
```bash
|
|
# Disable ACH in the device's compliance config
|
|
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
|
|
-H 'Content-Type: application/json' \
|
|
-d '{"auto_call_home_enabled": false}'
|
|
|
|
# Erase corrupted event chain
|
|
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
|
|
```
|
|
|
|
You can also do this via the SFM standalone UI → **Call Home** tab → set
|
|
`Enable Auto Call Home` to `Disabled` → **Write to Device**.
|
|
|
|
### Step 5: restore modem config (housekeeping)
|
|
|
|
Once the device-side ACH is disabled, restore the modem's Destination
|
|
Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
|
|
ACEmanager. The modem will resume normal bidirectional behavior, but the
|
|
unit won't issue any dial commands until ACH is explicitly re-enabled on
|
|
the device.
|
|
|
|
### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
|
|
fault is repaired. If you do, the call-home loop starts again immediately
|
|
and you'll be running this runbook a second time.
|
|
|
|
---
|
|
|
|
## Why this works — the failure mode explained
|
|
|
|
The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
|
|
at any moment:
|
|
|
|
- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
|
|
TCP to the device's serial port. This is what we need to interact with
|
|
the device.
|
|
- **Client mode** — when the device sends an AT dial command on its serial
|
|
TX line, the modem opens an outbound TCP to `Destination Address:Port`
|
|
and bridges that to serial.
|
|
|
|
A serial port in this configuration is **bidirectional**: the modem flips
|
|
between server and client modes on demand. When the device's firmware is
|
|
healthy and only dials occasionally, this works fine.
|
|
|
|
When the unit is constantly triggering events and ACH is set to "After
|
|
Event Recorded", the device sends an AT dial command every few seconds.
|
|
Each one causes the modem to:
|
|
|
|
1. Drop any active inbound TCP session
|
|
2. Flip to client mode
|
|
3. Attempt outbound TCP to `Destination Address:Port`
|
|
4. Hang for up to a minute waiting for it to succeed/fail
|
|
5. Drop back to server mode
|
|
|
|
**During the entire hang, no inbound TCP can establish.** Even between
|
|
hangs, the modem closes any existing inbound session before flipping. So
|
|
any tool that needs more than a few seconds of held TCP (e.g. POLL +
|
|
config read + write) gets repeatedly kicked off.
|
|
|
|
Clearing `Destination Address` removes step 3-4 from the cycle: the modem
|
|
has nowhere to dial, so it doesn't flip modes when it receives an AT dial
|
|
command. The serial port effectively becomes server-only, and inbound TCP
|
|
sessions can stay open as long as needed.
|
|
|
|
**This is a modem-layer issue, not a device firmware issue.** The device
|
|
is alive and responsive the whole time — confirmed in the BE9558H
|
|
recovery by 990 bytes of S3 responses received over a 120s slow-drip
|
|
session once the modem was no longer mode-flipping.
|
|
|
|
---
|
|
|
|
## Why simpler approaches don't work
|
|
|
|
| Approach | Why it fails |
|
|
|---|---|
|
|
| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
|
|
| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
|
|
| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
|
|
| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
|
|
| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
|
|
| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
|
|
|
|
The combination of `slow_drip` + cleared `Destination Address` works because:
|
|
|
|
1. The modem stops mode-flipping → TCP session stays open for the full
|
|
drip duration
|
|
2. Slow drip rate → device's UART RX FIFO never overflows even if
|
|
firmware is busy with event recording
|
|
3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
|
|
independent chances for the parser to land one valid frame
|
|
4. Once one Stop Monitoring is parsed, event recording halts → firmware
|
|
has CPU to spare → subsequent operations are trivially easy
|
|
|
|
---
|
|
|
|
## Tooling reference
|
|
|
|
All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
|
|
`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
|
|
overridable via `SFM_BASE_URL`.
|
|
|
|
### Endpoints added during BE9558H recovery
|
|
|
|
| Endpoint | Purpose |
|
|
|---|---|
|
|
| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
|
|
| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
|
|
| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
|
|
| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
|
|
| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
|
|
| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
|
|
| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
|
|
|
|
Also changed: default protocol recv timeout dropped from 30s → 10s in
|
|
`_build_client`. Added `connect_timeout` knob to same. Cleaned up
|
|
unhandled-exception path in `/device/monitor/status` so it returns 502
|
|
instead of 500 on protocol timeouts.
|
|
|
|
### Scripts
|
|
|
|
| Script | Purpose |
|
|
|---|---|
|
|
| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
|
|
| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
|
|
| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
|
|
| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
|
|
| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
|
|
|
|
---
|
|
|
|
## Incident log — BE9558H, 2026-05-16/17
|
|
|
|
What was wrong: Long-axis geophone developed an offset, constantly above
|
|
trigger threshold → constant event recording → after-event ACH set →
|
|
modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
|
|
Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
|
|
|
|
Diagnostic path:
|
|
|
|
1. `/device/info` slow, choked on event walk
|
|
2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
|
|
but didn't reach the wedged unit
|
|
3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
|
|
4. Built `/device/stop_monitoring_blind` — first version was a false
|
|
positive (no POLL preamble); fixed by including
|
|
`SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
|
|
5. Verified blind stop works on bench unit
|
|
6. Built `/device/stop_monitoring_spam` — 420 successful sends over
|
|
5 min, zero behavior change on field unit
|
|
7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
|
|
confirmed device was not fully locked up
|
|
8. Added outbound port-12345 firewall block → outbound attempts now fail
|
|
instantly but contention persisted
|
|
9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
|
|
broken pipe (modem closing on us)
|
|
10. Looked at full ACEmanager Port Configuration → **found
|
|
`Destination Address: 50.197.32.92` configured**, realized every AT
|
|
dial command was triggering a modem mode-flip that killed our inbound
|
|
11. Cleared Destination Address + Port → slow_drip held 120s, device
|
|
responded with 990 bytes, 39 stop commands acked
|
|
12. Disabled ACH at device level via `/device/call_home`, erased events
|
|
|
|
Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
|
|
device level, modem destination cleared (to be restored after physical
|
|
service).
|
|
|
|
Total time from "i was wondering if its possible to" first attempt to
|
|
recovery: ~7 hours of intermittent debugging across one evening.
|