Files
seismo-relay/docs/runbooks/wedged_unit_recovery.md
T
serversdown 1fff8179d6 Add runbook for recovering wedged units and new scripts for device management
- Created a comprehensive runbook (`wedged_unit_recovery.md`) detailing the recovery process for units stuck in a call-home loop, including symptoms, recovery steps, and explanations of the failure mode.
- Added `blind_stop.sh` script to send stop-monitoring commands in a tight loop for unresponsive devices.
- Introduced `rescue_device.sh` script to disable Auto Call Home and erase events from a busy device.
- Implemented `slow_drip.sh` script to send stop-monitoring frames at a slow rate to prevent UART overrun.
- Developed `spam_stop.sh` script to rapidly send stop-monitoring commands to a device.
- Created `watch_unit.sh` script for passive monitoring of device reachability, logging results over time.
2026-05-17 07:58:13 +00:00

256 lines
11 KiB
Markdown

# Runbook — Recovering a wedged unit stuck in a call-home loop
**Original incident:** BE9558H at `166.246.130.1:9034`, recovered 2026-05-17.
A field unit with a stuck-triggered geophone (or any hardware fault causing
constant event triggering) will record events back-to-back, and if Auto Call
Home is set to "After Event Recorded" the device will dial the office BW
ACH server in a tight loop. Combined with a Sierra Wireless modem in
bidirectional serial-TCP mode, this makes the unit effectively unreachable
from SFM — every TCP connection we open gets killed when the modem flips
from server-mode to client-mode to honor the device's next AT dial command.
This runbook describes how to break the loop and recover control.
---
## Symptoms
- Terra-View / SFM `/device/info` either hangs or fails on `count_events()`.
- `/device/monitor/status` and `/device/rescue` return 502 (protocol timeout
waiting for POLL response) or 503 (TCP connect refused).
- ACEmanager serial log shows repeating
`Connect to IP: <BW_IP> Port: <BW_PORT>``Shutdown TCP socket` cycles
every 30-60 seconds.
- Spam-mode endpoints (`/device/stop_monitoring_spam`) report many
`sent_ok` but the device's monitoring state never changes.
- `slow_drip` reports `[Errno 32] Broken pipe` after sending the preamble
but before completing the drip loop.
If you see *all* of these, the unit is in this exact failure mode.
---
## Quick reference — how to recover
You need **ACEmanager access** to the unit's modem.
### Step 1: stop the modem's mode-flipping
In ACEmanager → **Serial → Port Configuration**:
| Field | Set to |
|---|---|
| **Destination Address** | clear (blank) |
| **Destination Port** | `0` |
Click **Apply**. This removes the modem's auto-dial-out target. The device's
AT dial commands now error back at the modem instead of triggering a
mode-flip, so the modem stays in TCP-server mode permanently and our inbound
TCP sessions stay alive.
*(Optional belt-and-suspenders: also add the BW server's port to
**Security → Port Filtering - Outbound** as a blocked port, with
Outbound Port Filtering Mode = Blocked Ports.)*
### Step 2: stop monitoring on the device (slow drip)
From the SFM host:
```bash
/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>
```
Defaults are 120s duration with a drip every 3s. Watch the response:
- `duration_s ≈ 120` and `drips_sent ≈ 40` → session held the full duration ✓
- `bytes_received > 0` → device is responding ✓ (this is the success signal)
If `duration_s` is small or `send_error: "Broken pipe"`, Step 1 didn't take
hold — re-check ACEmanager, may need to reboot the modem after Apply.
### Step 3: confirm monitoring stopped
```bash
curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
# expect: {"is_monitoring": false, ...}
```
### Step 4: disable ACH at the device level + erase corrupted events
Either fire the rescue endpoint:
```bash
/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>
```
Or do the two steps manually:
```bash
# Disable ACH in the device's compliance config
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
-H 'Content-Type: application/json' \
-d '{"auto_call_home_enabled": false}'
# Erase corrupted event chain
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'
```
You can also do this via the SFM standalone UI → **Call Home** tab → set
`Enable Auto Call Home` to `Disabled`**Write to Device**.
### Step 5: restore modem config (housekeeping)
Once the device-side ACH is disabled, restore the modem's Destination
Address and Port to the original values (e.g. `50.197.32.92` / `12345`) in
ACEmanager. The modem will resume normal bidirectional behavior, but the
unit won't issue any dial commands until ACH is explicitly re-enabled on
the device.
### Step 6: do NOT re-enable ACH on this unit until the underlying hardware
fault is repaired. If you do, the call-home loop starts again immediately
and you'll be running this runbook a second time.
---
## Why this works — the failure mode explained
The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes
at any moment:
- **Server mode** — listens on `Device Port` (e.g. 9034), bridges inbound
TCP to the device's serial port. This is what we need to interact with
the device.
- **Client mode** — when the device sends an AT dial command on its serial
TX line, the modem opens an outbound TCP to `Destination Address:Port`
and bridges that to serial.
A serial port in this configuration is **bidirectional**: the modem flips
between server and client modes on demand. When the device's firmware is
healthy and only dials occasionally, this works fine.
When the unit is constantly triggering events and ACH is set to "After
Event Recorded", the device sends an AT dial command every few seconds.
Each one causes the modem to:
1. Drop any active inbound TCP session
2. Flip to client mode
3. Attempt outbound TCP to `Destination Address:Port`
4. Hang for up to a minute waiting for it to succeed/fail
5. Drop back to server mode
**During the entire hang, no inbound TCP can establish.** Even between
hangs, the modem closes any existing inbound session before flipping. So
any tool that needs more than a few seconds of held TCP (e.g. POLL +
config read + write) gets repeatedly kicked off.
Clearing `Destination Address` removes step 3-4 from the cycle: the modem
has nowhere to dial, so it doesn't flip modes when it receives an AT dial
command. The serial port effectively becomes server-only, and inbound TCP
sessions can stay open as long as needed.
**This is a modem-layer issue, not a device firmware issue.** The device
is alive and responsive the whole time — confirmed in the BE9558H
recovery by 990 bytes of S3 responses received over a 120s slow-drip
session once the modem was no longer mode-flipping.
---
## Why simpler approaches don't work
| Approach | Why it fails |
|---|---|
| Standard `/device/info` | Triggers `count_events()` 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario |
| `/device/rescue` race loop | Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete |
| `/device/stop_monitoring_blind` (single frame) | Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends) |
| `/device/stop_monitoring_spam` (sub-second cadence) | Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side |
| Outbound port firewall block alone | Stops the outbound TCP from succeeding, but doesn't stop the modem from *trying* and mode-flipping. Reduces but doesn't eliminate the contention. |
| Modem reboot | Temporary — as soon as the device starts triggering again, the loop resumes within seconds |
The combination of `slow_drip` + cleared `Destination Address` works because:
1. The modem stops mode-flipping → TCP session stays open for the full
drip duration
2. Slow drip rate → device's UART RX FIFO never overflows even if
firmware is busy with event recording
3. The drip is `SESSION_RESET + STOP_MONITORING` every 3s → many
independent chances for the parser to land one valid frame
4. Once one Stop Monitoring is parsed, event recording halts → firmware
has CPU to spare → subsequent operations are trivially easy
---
## Tooling reference
All endpoints live in `seismo-relay/sfm/server.py`. All scripts live in
`seismo-relay/scripts/` and default to SFM direct (`http://localhost:8200`),
overridable via `SFM_BASE_URL`.
### Endpoints added during BE9558H recovery
| Endpoint | Purpose |
|---|---|
| `GET /device/events/storage_range` | SUB 0x06 — first/last event keys, `is_empty` flag. ~2s, no event walk. |
| `GET /device/events/index` | SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s. |
| `POST /device/events/erase` | Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2. |
| `POST /device/rescue` | Disable ACH + erase in one TCP session. Short timeouts for race-loop usage. |
| `POST /device/stop_monitoring_blind` | Fire-and-forget Stop with full POLL preamble (single attempt). |
| `POST /device/stop_monitoring_spam` | Server-side tight retry loop, sub-second cadence, duration-bounded. |
| `POST /device/stop_monitoring_slow_drip` | One held TCP session, slow trickle of stop frames. **The endpoint that saved BE9558H.** |
Also changed: default protocol recv timeout dropped from 30s → 10s in
`_build_client`. Added `connect_timeout` knob to same. Cleaned up
unhandled-exception path in `/device/monitor/status` so it returns 502
instead of 500 on protocol timeouts.
### Scripts
| Script | Purpose |
|---|---|
| `scripts/rescue_device.sh` | Race-loop wrapper around `/device/rescue` |
| `scripts/blind_stop.sh` | Race-loop wrapper around `/device/stop_monitoring_blind` |
| `scripts/spam_stop.sh` | Single-call burst hammer (`/device/stop_monitoring_spam`) |
| `scripts/slow_drip.sh` | Single-call held-session drip (`/device/stop_monitoring_slow_drip`) |
| `scripts/watch_unit.sh` | Passive periodic reachability check, logs to file |
---
## Incident log — BE9558H, 2026-05-16/17
What was wrong: Long-axis geophone developed an offset, constantly above
trigger threshold → constant event recording → after-event ACH set →
modem dialing office BW server (`50.197.32.92:12345`) every 30-60s.
Local event chain corrupted (`next_boundary 0x100EE exceeds uint16`).
Diagnostic path:
1. `/device/info` slow, choked on event walk
2. Built lightweight probe endpoints (`storage_range`, `index`) — useful
but didn't reach the wedged unit
3. Built `/device/rescue` with short timeouts — got 502 (POLL no response)
4. Built `/device/stop_monitoring_blind` — first version was a false
positive (no POLL preamble); fixed by including
`SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA` in the dump
5. Verified blind stop works on bench unit
6. Built `/device/stop_monitoring_spam` — 420 successful sends over
5 min, zero behavior change on field unit
7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s,
confirmed device was not fully locked up
8. Added outbound port-12345 firewall block → outbound attempts now fail
instantly but contention persisted
9. Built `/device/stop_monitoring_slow_drip` — session died at 3s with
broken pipe (modem closing on us)
10. Looked at full ACEmanager Port Configuration → **found
`Destination Address: 50.197.32.92` configured**, realized every AT
dial command was triggering a modem mode-flip that killed our inbound
11. Cleared Destination Address + Port → slow_drip held 120s, device
responded with 990 bytes, 39 stop commands acked
12. Disabled ACH at device level via `/device/call_home`, erased events
Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at
device level, modem destination cleared (to be restored after physical
service).
Total time from "i was wondering if its possible to" first attempt to
recovery: ~7 hours of intermittent debugging across one evening.