Histogram body codec — full RE + peak-count fix that resolves the prod inflation incident #26

Merged
serversdown merged 5 commits from feat/wire-histogram-codec into dev 2026-05-22 13:08:04 -04:00
Owner

Summary
Wires the verified MiniMate Plus histogram body codec into the import path, fully decoding .AB0H event files for the first time. Includes the post-mortem fix for the production incident where a faulty initial codec produced 35× inflated PVS sums on certain units (BE9558 / BE18003 Histogram+Continuous events) and required a DB rollback.

Background
The histogram body codec was previously a stub — only the waveform-mode codec (decode_waveform_v2) worked. read_blastware_file returned empty samples for any histogram event, falling back to the BW ASCII report's event-level peaks for the DB row.

A first cut at the histogram codec (7183b95) interpreted per-channel peak counts as uint16 LE at byte offsets [6:8] / [10:12] / [14:16] / [18:20]. This happened to be byte-exact against the N844 (BE12844) fixture corpus and passed 24 regression tests.

When deployed to prod and run as part of a backfill against the live waveform store, max PVS exploded from 988 → 34501 (35×), driven by histogram events from BE9558 and BE18003 units producing per-channel peaks up to 268 in/s (impossible — 10 in/s is the geophone full-scale at Normal range). Production DB was rolled back to the pre-backfill snapshot.

Root cause
The peak count field is uint8 at byte[6] / [10] / [14] / [18], not uint16 LE spanning [6:8] etc. The N844 fixture corpus has zero values in the adjacent bytes [7] / [11] / [15] / [19] for every block, making uint8 and uint16 LE numerically equivalent there. On non-N844 events those adjacent bytes hold an annotation field whose meaning isn't fully understood (empirically non-zero on intervals with sub-Hz or unmeasurable freq); the prior interpretation read them as the high byte of the peak count, producing physically impossible amplitudes.

Cross-correlated against BW's per-interval ASCII export on:

K558 (BE9558, Tran-drift fault): 100% T/V/L/M match across 3 events × 1435 blocks each
T003 (BE18003, Histogram+Continuous): 100% T/V/L, 99.3% M (the 0.7% delta is 0.05 dB rounding in BW's display)
N599 (BE13599): 100% all channels
N844 (BE12844, original fixture corpus): 100% all channels — unchanged
Secondary fix: bw_report preservation in backfill
scripts/backfill_sidecars.py regenerates .sfm.json sidecars from a rebuilt Event object. The Event is built from the binary + .a5.pkl + DB row, but the bw_report block (parsed from the original _ASCII.TXT at ingest time, then discarded — the .TXT itself isn't stored) wasn't preserved across regen. Pre-fix, every backfill silently wiped bw_report from every sidecar.

Now preserved verbatim, alongside the existing review and extensions preservation.

Validation
Tier-by-tier verification on dev (10.0.0.44) against an rsync'd snapshot of prod's DB + waveform store:

Layer Result
26 unit tests (24 N844 byte-exact + 2 synthetic K558 regression)
Backfill PVS sum (snapshot ~14k events) 2059 → 1839 (10% reduction = K558 inflation removed)
Idempotency (re-run --force) identical output, max_pvs unchanged
bw_report preservation (10317 sidecars) 0 WIPED, 0 CHANGED
Runtime API: K558 waveform.json max_abs = 0.293 in/s, matches DB peak byte-exact
Runtime API: N844 baseline max 0.005-0.01 in/s = clean noise floor
Top-10 PVS events post-backfill 2 legit MiniMate Plus high-amplitude events + 8 Micromate events — sensible distribution
Known limitations
byte[5]!=0 histogram sub-format (filed as separate work): a handful of events (T190LD5Q, O121L4L1 in prod) use a histogram body format my walker doesn't recognize (byte[5] is non-zero instead of zero). Old codec and new codec both produce 0 valid blocks on these — DB peaks come from the bw_report ASCII overlay (which is what BW computed from the same binary, so still correct). Pre-existing, not a regression. Will need binary + ASCII pairs from a few byte[5]!=0 events to RE the format.
Deploy procedure
Rebuild sfm service against this branch
Optionally run scripts/backfill_sidecars.py --force to apply the codec correction to existing events (will reduce inflated K558-style PVS values to their correct magnitudes)
The scripts/check_bw_report_preservation.py tool lets you gate the backfill on preservation if you want belt-and-suspenders — capture a baseline, run backfill, diff to confirm 0 WIPED before/after
Files
minimateplus/histogram_codec.py — new, full codec implementation (uint8 peaks, uint16 LE half-periods, annotation byte preserved as record["annotations"] tuple for future RE)
minimateplus/event_file_io.py — wires codec into read_blastware_file, leaves peak_values=None when samples can't be decoded (so DB peaks fall back to bw_report rather than getting overwritten with zeros)
scripts/backfill_sidecars.py — preserves bw_report block on regen, filters Thor IDF files from the BW-only walk, cascades .h5 regen when sidecar is stale
scripts/check_bw_report_preservation.py — new, snapshot/diff tool for verifying backfill doesn't wipe bw_report (used in dev validation)
tests/test_histogram_codec.py — 26 tests including a synthetic K558 interval-12 block as regression lock for the uint8 fix
docs/histogram_codec_re_status.md — status doc with the RE history including the uint8 retraction
Dockerfile — includes scripts/ and micromate/ directories (were missing, broke Thor IDF endpoint)

Summary Wires the verified MiniMate Plus histogram body codec into the import path, fully decoding .AB0H event files for the first time. Includes the post-mortem fix for the production incident where a faulty initial codec produced 35× inflated PVS sums on certain units (BE9558 / BE18003 Histogram+Continuous events) and required a DB rollback. Background The histogram body codec was previously a stub — only the waveform-mode codec (decode_waveform_v2) worked. read_blastware_file returned empty samples for any histogram event, falling back to the BW ASCII report's event-level peaks for the DB row. A first cut at the histogram codec (7183b95) interpreted per-channel peak counts as uint16 LE at byte offsets [6:8] / [10:12] / [14:16] / [18:20]. This happened to be byte-exact against the N844 (BE12844) fixture corpus and passed 24 regression tests. When deployed to prod and run as part of a backfill against the live waveform store, max PVS exploded from 988 → 34501 (35×), driven by histogram events from BE9558 and BE18003 units producing per-channel peaks up to 268 in/s (impossible — 10 in/s is the geophone full-scale at Normal range). Production DB was rolled back to the pre-backfill snapshot. Root cause The peak count field is uint8 at byte[6] / [10] / [14] / [18], not uint16 LE spanning [6:8] etc. The N844 fixture corpus has zero values in the adjacent bytes [7] / [11] / [15] / [19] for every block, making uint8 and uint16 LE numerically equivalent there. On non-N844 events those adjacent bytes hold an annotation field whose meaning isn't fully understood (empirically non-zero on intervals with sub-Hz or unmeasurable freq); the prior interpretation read them as the high byte of the peak count, producing physically impossible amplitudes. Cross-correlated against BW's per-interval ASCII export on: K558 (BE9558, Tran-drift fault): 100% T/V/L/M match across 3 events × 1435 blocks each T003 (BE18003, Histogram+Continuous): 100% T/V/L, 99.3% M (the 0.7% delta is 0.05 dB rounding in BW's display) N599 (BE13599): 100% all channels N844 (BE12844, original fixture corpus): 100% all channels — unchanged Secondary fix: bw_report preservation in backfill scripts/backfill_sidecars.py regenerates .sfm.json sidecars from a rebuilt Event object. The Event is built from the binary + .a5.pkl + DB row, but the bw_report block (parsed from the original _ASCII.TXT at ingest time, then discarded — the .TXT itself isn't stored) wasn't preserved across regen. Pre-fix, every backfill silently wiped bw_report from every sidecar. Now preserved verbatim, alongside the existing review and extensions preservation. Validation Tier-by-tier verification on dev (10.0.0.44) against an rsync'd snapshot of prod's DB + waveform store: Layer Result 26 unit tests (24 N844 byte-exact + 2 synthetic K558 regression) ✅ Backfill PVS sum (snapshot ~14k events) 2059 → 1839 (10% reduction = K558 inflation removed) Idempotency (re-run --force) identical output, max_pvs unchanged bw_report preservation (10317 sidecars) 0 WIPED, 0 CHANGED Runtime API: K558 waveform.json max_abs = 0.293 in/s, matches DB peak byte-exact Runtime API: N844 baseline max 0.005-0.01 in/s = clean noise floor Top-10 PVS events post-backfill 2 legit MiniMate Plus high-amplitude events + 8 Micromate events — sensible distribution Known limitations byte[5]!=0 histogram sub-format (filed as separate work): a handful of events (T190LD5Q, O121L4L1 in prod) use a histogram body format my walker doesn't recognize (byte[5] is non-zero instead of zero). Old codec and new codec both produce 0 valid blocks on these — DB peaks come from the bw_report ASCII overlay (which is what BW computed from the same binary, so still correct). Pre-existing, not a regression. Will need binary + ASCII pairs from a few byte[5]!=0 events to RE the format. Deploy procedure Rebuild sfm service against this branch Optionally run scripts/backfill_sidecars.py --force to apply the codec correction to existing events (will reduce inflated K558-style PVS values to their correct magnitudes) The scripts/check_bw_report_preservation.py tool lets you gate the backfill on preservation if you want belt-and-suspenders — capture a baseline, run backfill, diff to confirm 0 WIPED before/after Files minimateplus/histogram_codec.py — new, full codec implementation (uint8 peaks, uint16 LE half-periods, annotation byte preserved as record["annotations"] tuple for future RE) minimateplus/event_file_io.py — wires codec into read_blastware_file, leaves peak_values=None when samples can't be decoded (so DB peaks fall back to bw_report rather than getting overwritten with zeros) scripts/backfill_sidecars.py — preserves bw_report block on regen, filters Thor IDF files from the BW-only walk, cascades .h5 regen when sidecar is stale scripts/check_bw_report_preservation.py — new, snapshot/diff tool for verifying backfill doesn't wipe bw_report (used in dev validation) tests/test_histogram_codec.py — 26 tests including a synthetic K558 interval-12 block as regression lock for the uint8 fix docs/histogram_codec_re_status.md — status doc with the RE history including the uint8 retraction Dockerfile — includes scripts/ and micromate/ directories (were missing, broke Thor IDF endpoint)
serversdown added 5 commits 2026-05-22 13:07:47 -04:00
Discovered while dry-running the backfill on prod: the waveform store
contains both BW (.AB0*/.N00) and Thor IDF (.IDFW/.IDFH) event files
side-by-side because both go through the same per-serial directory
layout.  The script's `_looks_like_event_file` heuristic accepted any
3-4 char extension ending in W or H, which matched both BW and IDF.

The script then routes everything through
`event_file_io.read_blastware_file`, which rejects IDF files with
"not a Blastware file (bad header prefix)" — 3807 errors on prod
out of 7201 total events.

Thor IDF events have their own ingest path
(`WaveformStore.save_imported_idf`) and their sidecars are populated
at ingest from the paired `.IDFW.txt` ASCII report.  The backfill
script has no value to add for them — there's no decoder to refresh,
and the sidecar metadata is already correct.  Filter them out.

After this fix, the prod backfill should run clean: ~3392 BW events
get sidecar+h5 regen as expected; the ~3807 Thor IDF events are
silently skipped.

The proper "IDF backfill" (refresh tool_version stamp on IDF
sidecars by re-running event_to_sidecar_dict against the stored
DB row + sidecar extensions block) is a separate, narrower
follow-up — not blocking the BW backfill rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovered while running the backfill on prod: certain histogram
blocks contain an undocumented extension byte format whose naive
uint16 LE interpretation yields physically impossible peak values
(150+ in/s when the device max is 10).  Concrete example from
K558LKSG.3I0H block at body+7424:

  bytes [6:10] = 05 79 69 00
  current code: T_peak = uint16 LE = 0x7905 = 30981 → 154.9 in/s
  reality:     T_peak = byte[6] = 5 → 0.025 in/s (matches BW display)

The high byte (0x79 here) appears to be an extension field — possibly
"time of peak within interval" or a Histogram+Continuous sub-mode
marker.  Observed across BE9558 and BE18003 units in prod data; never
appeared in the BE12844 fixture corpus the codec was originally
verified against.

Effect on prod: 26 out of 1433 blocks in this one event had inflated
peaks, plus dozens of similar events across the fleet → sum(PVS)
inflated from baseline 988 to 34501 (35x).  Rolled back via the
pre-backfill snapshot before any UI exposure.

Defensive fix: bounds-check peak counts in `_decode_block`.  Any
field exceeding `_MAX_PEAK_COUNT` (4096 = ~20 in/s, well past the
device's 10 in/s Normal-range FS) causes the block to be skipped
entirely.  Other valid blocks in the same event still decode
correctly.

Trade-off: those skipped blocks lose their per-interval data
(peaks + frequencies).  Acceptable until the extension format is
reverse-engineered — better than propagating bogus values into PVS
computations downstream.

The 24 existing tests all still pass — the fixtures used during the
original codec development don't exercise the extension-byte case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
histogram_codec: drop _MAX_PEAK_COUNT 4096 → 2200. The old ceiling
let extension-byte blocks slip through at up to 20.48 in/s per
channel, producing 35× inflated PVS sums when first deployed to
prod. 2200 covers Normal-range full-scale (10 in/s = 2000 counts)
plus 10% headroom for quantization edge cases.

backfill_sidecars: also preserve the bw_report block alongside
review + extensions when regenerating sidecars. event_to_sidecar_dict
takes a BwAsciiReport dataclass not a dict, so for bw_report we
overlay the existing block after regen rather than passing as a kwarg.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
the BE9558 / BE18003 extension-byte case

The bytes at [7]/[11]/[15]/[19] are an annotation field (purpose still
unclear — empirically non-zero on intervals with sub-Hz or unmeasurable
freq), NOT the high byte of the peak count.  The N844 fixture corpus
the original RE was done against had zero values in those bytes for
every block, so uint8 and uint16 LE were equivalent there — but on
real BE9558 Tran-drift events and BE18003 Histogram+Continuous events
the uint16 LE interpretation produced peaks up to 268 in/s and 35×
inflated PVS sums.

Cross-correlated against BW's per-interval ASCII export on:
  - K558LKZU/LL1P/LL3K  → 100% T/V/L/M peak match (1435 blocks each)
  - T003LKZR/LL0O/LL1M  → 100% T/V/L, 99.3% M (0.05 dB rounding only)
  - N599LKZS/LL0L        → 100% all channels
  - N844 fixture corpus  → 100% all channels (unchanged)

Annotations preserved on every record for future RE; the defensive
_MAX_PEAK_COUNT bound is no longer needed (uint8 maxes at 1.275 in/s,
well below any physical limit).

Synthetic regression test added using the verbatim K558LKZU.RE0H
interval-12 block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two-step tool to verify that backfill_sidecars doesn't wipe the
bw_report block from existing sidecars.  Workflow:

  1. snapshot --out before.json    (canonical-JSON hash per sidecar)
  2. run backfill
  3. diff --baseline before.json   (classifies every sidecar:
       PRESERVED / CHANGED / WIPED / STILL_MISSING / NEW / ADDED / REMOVED)

Exit code 1 if any WIPED or CHANGED entries found, 0 otherwise — so
it can gate a CI step or a deploy script.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
serversdown merged commit 9ef424d098 into dev 2026-05-22 13:08:04 -04:00
Sign in to join this conversation.