Files
seismo-relay/docs/runbooks/wedged_unit_recovery.md
T
serversdown 1fff8179d6 Add runbook for recovering wedged units and new scripts for device management
- Created a comprehensive runbook (`wedged_unit_recovery.md`) detailing the recovery process for units stuck in a call-home loop, including symptoms, recovery steps, and explanations of the failure mode.
- Added `blind_stop.sh` script to send stop-monitoring commands in a tight loop for unresponsive devices.
- Introduced `rescue_device.sh` script to disable Auto Call Home and erase events from a busy device.
- Implemented `slow_drip.sh` script to send stop-monitoring frames at a slow rate to prevent UART overrun.
- Developed `spam_stop.sh` script to rapidly send stop-monitoring commands to a device.
- Created `watch_unit.sh` script for passive monitoring of device reachability, logging results over time.
2026-05-17 07:58:13 +00:00

11 KiB

Runbook — Recovering a wedged unit stuck in a call-home loop

Original incident: BE9558H at 166.246.130.1:9034, recovered 2026-05-17.

A field unit with a stuck-triggered geophone (or any hardware fault causing constant event triggering) will record events back-to-back, and if Auto Call Home is set to "After Event Recorded" the device will dial the office BW ACH server in a tight loop. Combined with a Sierra Wireless modem in bidirectional serial-TCP mode, this makes the unit effectively unreachable from SFM — every TCP connection we open gets killed when the modem flips from server-mode to client-mode to honor the device's next AT dial command.

This runbook describes how to break the loop and recover control.


Symptoms

  • Terra-View / SFM /device/info either hangs or fails on count_events().
  • /device/monitor/status and /device/rescue return 502 (protocol timeout waiting for POLL response) or 503 (TCP connect refused).
  • ACEmanager serial log shows repeating Connect to IP: <BW_IP> Port: <BW_PORT>Shutdown TCP socket cycles every 30-60 seconds.
  • Spam-mode endpoints (/device/stop_monitoring_spam) report many sent_ok but the device's monitoring state never changes.
  • slow_drip reports [Errno 32] Broken pipe after sending the preamble but before completing the drip loop.

If you see all of these, the unit is in this exact failure mode.


Quick reference — how to recover

You need ACEmanager access to the unit's modem.

Step 1: stop the modem's mode-flipping

In ACEmanager → Serial → Port Configuration:

Field Set to
Destination Address clear (blank)
Destination Port 0

Click Apply. This removes the modem's auto-dial-out target. The device's AT dial commands now error back at the modem instead of triggering a mode-flip, so the modem stays in TCP-server mode permanently and our inbound TCP sessions stay alive.

(Optional belt-and-suspenders: also add the BW server's port to Security → Port Filtering - Outbound as a blocked port, with Outbound Port Filtering Mode = Blocked Ports.)

Step 2: stop monitoring on the device (slow drip)

From the SFM host:

/home/serversdown/seismo-relay/scripts/slow_drip.sh <DEVICE_IP> <PORT>

Defaults are 120s duration with a drip every 3s. Watch the response:

  • duration_s ≈ 120 and drips_sent ≈ 40 → session held the full duration ✓
  • bytes_received > 0 → device is responding ✓ (this is the success signal)

If duration_s is small or send_error: "Broken pipe", Step 1 didn't take hold — re-check ACEmanager, may need to reboot the modem after Apply.

Step 3: confirm monitoring stopped

curl 'http://localhost:8200/device/monitor/status?host=<DEVICE_IP>&tcp_port=<PORT>&force=true'
# expect: {"is_monitoring": false, ...}

Step 4: disable ACH at the device level + erase corrupted events

Either fire the rescue endpoint:

/home/serversdown/seismo-relay/scripts/rescue_device.sh <DEVICE_IP> <PORT>

Or do the two steps manually:

# Disable ACH in the device's compliance config
curl -X POST 'http://localhost:8200/device/call_home?host=<DEVICE_IP>&tcp_port=<PORT>' \
  -H 'Content-Type: application/json' \
  -d '{"auto_call_home_enabled": false}'

# Erase corrupted event chain
curl -X POST 'http://localhost:8200/device/events/erase?host=<DEVICE_IP>&tcp_port=<PORT>'

You can also do this via the SFM standalone UI → Call Home tab → set Enable Auto Call Home to DisabledWrite to Device.

Step 5: restore modem config (housekeeping)

Once the device-side ACH is disabled, restore the modem's Destination Address and Port to the original values (e.g. 50.197.32.92 / 12345) in ACEmanager. The modem will resume normal bidirectional behavior, but the unit won't issue any dial commands until ACH is explicitly re-enabled on the device.

Step 6: do NOT re-enable ACH on this unit until the underlying hardware

fault is repaired. If you do, the call-home loop starts again immediately and you'll be running this runbook a second time.


Why this works — the failure mode explained

The Sierra Wireless RV50/RV55 serial port operates in one of two TCP modes at any moment:

  • Server mode — listens on Device Port (e.g. 9034), bridges inbound TCP to the device's serial port. This is what we need to interact with the device.
  • Client mode — when the device sends an AT dial command on its serial TX line, the modem opens an outbound TCP to Destination Address:Port and bridges that to serial.

A serial port in this configuration is bidirectional: the modem flips between server and client modes on demand. When the device's firmware is healthy and only dials occasionally, this works fine.

When the unit is constantly triggering events and ACH is set to "After Event Recorded", the device sends an AT dial command every few seconds. Each one causes the modem to:

  1. Drop any active inbound TCP session
  2. Flip to client mode
  3. Attempt outbound TCP to Destination Address:Port
  4. Hang for up to a minute waiting for it to succeed/fail
  5. Drop back to server mode

During the entire hang, no inbound TCP can establish. Even between hangs, the modem closes any existing inbound session before flipping. So any tool that needs more than a few seconds of held TCP (e.g. POLL + config read + write) gets repeatedly kicked off.

Clearing Destination Address removes step 3-4 from the cycle: the modem has nowhere to dial, so it doesn't flip modes when it receives an AT dial command. The serial port effectively becomes server-only, and inbound TCP sessions can stay open as long as needed.

This is a modem-layer issue, not a device firmware issue. The device is alive and responsive the whole time — confirmed in the BE9558H recovery by 990 bytes of S3 responses received over a 120s slow-drip session once the modem was no longer mode-flipping.


Why simpler approaches don't work

Approach Why it fails
Standard /device/info Triggers count_events() 1E/1F walk, takes 90s+ and hits corrupted event chain in this scenario
/device/rescue race loop Gets 502 (protocol timeout) because the modem closes the TCP before the POLL handshake can complete
/device/stop_monitoring_blind (single frame) Even if the bytes leave the wire, the device's protocol parser ignores write commands without a preceding POLL handshake (early-version bug, now fixed by including POLL preamble in blind sends)
/device/stop_monitoring_spam (sub-second cadence) Each session is killed by the modem's mode-flip before the device can drain its UART RX buffer; high-rate spam also risks UART FIFO overrun on the device side
Outbound port firewall block alone Stops the outbound TCP from succeeding, but doesn't stop the modem from trying and mode-flipping. Reduces but doesn't eliminate the contention.
Modem reboot Temporary — as soon as the device starts triggering again, the loop resumes within seconds

The combination of slow_drip + cleared Destination Address works because:

  1. The modem stops mode-flipping → TCP session stays open for the full drip duration
  2. Slow drip rate → device's UART RX FIFO never overflows even if firmware is busy with event recording
  3. The drip is SESSION_RESET + STOP_MONITORING every 3s → many independent chances for the parser to land one valid frame
  4. Once one Stop Monitoring is parsed, event recording halts → firmware has CPU to spare → subsequent operations are trivially easy

Tooling reference

All endpoints live in seismo-relay/sfm/server.py. All scripts live in seismo-relay/scripts/ and default to SFM direct (http://localhost:8200), overridable via SFM_BASE_URL.

Endpoints added during BE9558H recovery

Endpoint Purpose
GET /device/events/storage_range SUB 0x06 — first/last event keys, is_empty flag. ~2s, no event walk.
GET /device/events/index SUB 0x08 — lifetime event counter (does NOT decrement on erase). ~2s.
POST /device/events/erase Full erase sequence 0xA3 → 0x1C → 0x06 → 0xA2.
POST /device/rescue Disable ACH + erase in one TCP session. Short timeouts for race-loop usage.
POST /device/stop_monitoring_blind Fire-and-forget Stop with full POLL preamble (single attempt).
POST /device/stop_monitoring_spam Server-side tight retry loop, sub-second cadence, duration-bounded.
POST /device/stop_monitoring_slow_drip One held TCP session, slow trickle of stop frames. The endpoint that saved BE9558H.

Also changed: default protocol recv timeout dropped from 30s → 10s in _build_client. Added connect_timeout knob to same. Cleaned up unhandled-exception path in /device/monitor/status so it returns 502 instead of 500 on protocol timeouts.

Scripts

Script Purpose
scripts/rescue_device.sh Race-loop wrapper around /device/rescue
scripts/blind_stop.sh Race-loop wrapper around /device/stop_monitoring_blind
scripts/spam_stop.sh Single-call burst hammer (/device/stop_monitoring_spam)
scripts/slow_drip.sh Single-call held-session drip (/device/stop_monitoring_slow_drip)
scripts/watch_unit.sh Passive periodic reachability check, logs to file

Incident log — BE9558H, 2026-05-16/17

What was wrong: Long-axis geophone developed an offset, constantly above trigger threshold → constant event recording → after-event ACH set → modem dialing office BW server (50.197.32.92:12345) every 30-60s. Local event chain corrupted (next_boundary 0x100EE exceeds uint16).

Diagnostic path:

  1. /device/info slow, choked on event walk
  2. Built lightweight probe endpoints (storage_range, index) — useful but didn't reach the wedged unit
  3. Built /device/rescue with short timeouts — got 502 (POLL no response)
  4. Built /device/stop_monitoring_blind — first version was a false positive (no POLL preamble); fixed by including SESSION_RESET+POLL_PROBE+SESSION_RESET+POLL_DATA in the dump
  5. Verified blind stop works on bench unit
  6. Built /device/stop_monitoring_spam — 420 successful sends over 5 min, zero behavior change on field unit
  7. Inspected ACEmanager logs → saw outbound dial-out attempts every ~30s, confirmed device was not fully locked up
  8. Added outbound port-12345 firewall block → outbound attempts now fail instantly but contention persisted
  9. Built /device/stop_monitoring_slow_drip — session died at 3s with broken pipe (modem closing on us)
  10. Looked at full ACEmanager Port Configuration → found Destination Address: 50.197.32.92 configured, realized every AT dial command was triggering a modem mode-flip that killed our inbound
  11. Cleared Destination Address + Port → slow_drip held 120s, device responded with 990 bytes, 39 stop commands acked
  12. Disabled ACH at device level via /device/call_home, erased events

Final state: device IDLE, memory 958.1 / 960 KB free, ACH disabled at device level, modem destination cleared (to be restored after physical service).

Total time from "i was wondering if its possible to" first attempt to recovery: ~7 hours of intermittent debugging across one evening.