serversdown/seismo-relay

Histogram body codec — full RE + peak-count fix that resolves the prod inflation incident #26

Merged

serversdown merged 5 commits from feat/wire-histogram-codec into dev

2026-05-22 13:08:04 -04:00

Author	SHA1	Message	Date
serversdown	ed6982c512	scripts: bw_report preservation check for backfill safety Two-step tool to verify that backfill_sidecars doesn't wipe the bw_report block from existing sidecars. Workflow: 1. snapshot --out before.json (canonical-JSON hash per sidecar) 2. run backfill 3. diff --baseline before.json (classifies every sidecar: PRESERVED / CHANGED / WIPED / STILL_MISSING / NEW / ADDED / REMOVED) Exit code 1 if any WIPED or CHANGED entries found, 0 otherwise — so it can gate a CI step or a deploy script. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 06:13:52 +00:00
serversdown	d506ebc103	histogram_codec: peak count is uint8 (not uint16 LE) — properly cracks the BE9558 / BE18003 extension-byte case The bytes at [7]/[11]/[15]/[19] are an annotation field (purpose still unclear — empirically non-zero on intervals with sub-Hz or unmeasurable freq), NOT the high byte of the peak count. The N844 fixture corpus the original RE was done against had zero values in those bytes for every block, so uint8 and uint16 LE were equivalent there — but on real BE9558 Tran-drift events and BE18003 Histogram+Continuous events the uint16 LE interpretation produced peaks up to 268 in/s and 35× inflated PVS sums. Cross-correlated against BW's per-interval ASCII export on: - K558LKZU/LL1P/LL3K → 100% T/V/L/M peak match (1435 blocks each) - T003LKZR/LL0O/LL1M → 100% T/V/L, 99.3% M (0.05 dB rounding only) - N599LKZS/LL0L → 100% all channels - N844 fixture corpus → 100% all channels (unchanged) Annotations preserved on every record for future RE; the defensive _MAX_PEAK_COUNT bound is no longer needed (uint8 maxes at 1.275 in/s, well below any physical limit). Synthetic regression test added using the verbatim K558LKZU.RE0H interval-12 block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 06:05:19 +00:00
serversdown	e949232875	histogram_codec + backfill: tighter peak ceiling, preserve bw_report histogram_codec: drop _MAX_PEAK_COUNT 4096 → 2200. The old ceiling let extension-byte blocks slip through at up to 20.48 in/s per channel, producing 35× inflated PVS sums when first deployed to prod. 2200 covers Normal-range full-scale (10 in/s = 2000 counts) plus 10% headroom for quantization edge cases. backfill_sidecars: also preserve the bw_report block alongside review + extensions when regenerating sidecars. event_to_sidecar_dict takes a BwAsciiReport dataclass not a dict, so for bw_report we overlay the existing block after regen rather than passing as a kwarg. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:50:10 +00:00
serversdown	bc5a2d3f19	histogram_codec: defensive bounds-check on peak counts Discovered while running the backfill on prod: certain histogram blocks contain an undocumented extension byte format whose naive uint16 LE interpretation yields physically impossible peak values (150+ in/s when the device max is 10). Concrete example from K558LKSG.3I0H block at body+7424: bytes [6:10] = 05 79 69 00 current code: T_peak = uint16 LE = 0x7905 = 30981 → 154.9 in/s reality: T_peak = byte[6] = 5 → 0.025 in/s (matches BW display) The high byte (0x79 here) appears to be an extension field — possibly "time of peak within interval" or a Histogram+Continuous sub-mode marker. Observed across BE9558 and BE18003 units in prod data; never appeared in the BE12844 fixture corpus the codec was originally verified against. Effect on prod: 26 out of 1433 blocks in this one event had inflated peaks, plus dozens of similar events across the fleet → sum(PVS) inflated from baseline 988 to 34501 (35x). Rolled back via the pre-backfill snapshot before any UI exposure. Defensive fix: bounds-check peak counts in `_decode_block`. Any field exceeding `_MAX_PEAK_COUNT` (4096 = ~20 in/s, well past the device's 10 in/s Normal-range FS) causes the block to be skipped entirely. Other valid blocks in the same event still decode correctly. Trade-off: those skipped blocks lose their per-interval data (peaks + frequencies). Acceptable until the extension format is reverse-engineered — better than propagating bogus values into PVS computations downstream. The 24 existing tests all still pass — the fixtures used during the original codec development don't exercise the extension-byte case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 02:17:33 +00:00
serversdown	88549bc659	backfill_sidecars: filter out Thor IDF files Discovered while dry-running the backfill on prod: the waveform store contains both BW (.AB0*/.N00) and Thor IDF (.IDFW/.IDFH) event files side-by-side because both go through the same per-serial directory layout. The script's `_looks_like_event_file` heuristic accepted any 3-4 char extension ending in W or H, which matched both BW and IDF. The script then routes everything through `event_file_io.read_blastware_file`, which rejects IDF files with "not a Blastware file (bad header prefix)" — 3807 errors on prod out of 7201 total events. Thor IDF events have their own ingest path (`WaveformStore.save_imported_idf`) and their sidecars are populated at ingest from the paired `.IDFW.txt` ASCII report. The backfill script has no value to add for them — there's no decoder to refresh, and the sidecar metadata is already correct. Filter them out. After this fix, the prod backfill should run clean: ~3392 BW events get sidecar+h5 regen as expected; the ~3807 Thor IDF events are silently skipped. The proper "IDF backfill" (refresh tool_version stamp on IDF sidecars by re-running event_to_sidecar_dict against the stored DB row + sidecar extensions block) is a separate, narrower follow-up — not blocking the BW backfill rollout. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-21 01:20:08 +00:00