Mirror what the ingest path does: BW's reported peaks (and sample_rate
/ record_time) take precedence over codec output where present.
Without this, --force backfill silently overwrites bw_report-overlaid
DB columns with codec-derived peaks. Wrong for events where the codec
doesn't fully decode (waveform walker edge cases on SP0/SS0/SV0-style
events, histogram byte[5]!=0 sub-format that isn't yet RE'd), producing
PVS=0 on real high-amplitude events. Bit on prod 2026-05-22 with
three top-10 waveform events ending up at PVS=0 (rolled back same day,
this fix is the proper resolution).
New helper minimateplus.event_file_io.apply_bw_report_dict_to_event
operates on the projected sidecar dict shape (the structure
_bw_report_to_dict produces, which is what gets preserved in the
sidecar). Mirrors apply_report_to_event's semantics: only writes
fields where bw_report has a non-None value, no-ops cleanly on
empty / None input.
Dev validation against prod snapshot:
pre : 1839.7315 pvs_sum 356 events with DB PVS ≠ sidecar bw_report
post : 2016.4902 pvs_sum 2 events still mismatched (both have NULL
timestamp + duplicate rows, edge case)
Both edge-case events DO get the correct value written by the new
backfill — their stale rows from prior backfills remain because
UNIQUE(serial, timestamp) doesn't fire on NULL. Separate dedup
cleanup needed for those 2 events (0.014% of corpus); not blocking.
Backfill remains idempotent + bw_report preservation still passes
(0 WIPED, 0 CHANGED on the 3rd consecutive run).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
histogram_codec: drop _MAX_PEAK_COUNT 4096 → 2200. The old ceiling
let extension-byte blocks slip through at up to 20.48 in/s per
channel, producing 35× inflated PVS sums when first deployed to
prod. 2200 covers Normal-range full-scale (10 in/s = 2000 counts)
plus 10% headroom for quantization edge cases.
backfill_sidecars: also preserve the bw_report block alongside
review + extensions when regenerating sidecars. event_to_sidecar_dict
takes a BwAsciiReport dataclass not a dict, so for bw_report we
overlay the existing block after regen rather than passing as a kwarg.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovered while dry-running the backfill on prod: the waveform store
contains both BW (.AB0*/.N00) and Thor IDF (.IDFW/.IDFH) event files
side-by-side because both go through the same per-serial directory
layout. The script's `_looks_like_event_file` heuristic accepted any
3-4 char extension ending in W or H, which matched both BW and IDF.
The script then routes everything through
`event_file_io.read_blastware_file`, which rejects IDF files with
"not a Blastware file (bad header prefix)" — 3807 errors on prod
out of 7201 total events.
Thor IDF events have their own ingest path
(`WaveformStore.save_imported_idf`) and their sidecars are populated
at ingest from the paired `.IDFW.txt` ASCII report. The backfill
script has no value to add for them — there's no decoder to refresh,
and the sidecar metadata is already correct. Filter them out.
After this fix, the prod backfill should run clean: ~3392 BW events
get sidecar+h5 regen as expected; the ~3807 Thor IDF events are
silently skipped.
The proper "IDF backfill" (refresh tool_version stamp on IDF
sidecars by re-running event_to_sidecar_dict against the stored
DB row + sidecar extensions block) is a separate, narrower
follow-up — not blocking the BW backfill rollout.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The histogram-mode event body is now byte-exact decodable.
Companion to the waveform body codec — together they cover every
event file the watcher forwards. Cracked in one session via
cross-event correlation against BW's ASCII export.
The §7.6.2 spec in instantel_protocol_reference.md was structurally
correct (32-byte blocks) but the per-sample semantics were
under-documented. Cross-checking block 130 of N844L6Z8.ZR0H
against its TXT row revealed the layout perfectly:
slot[0] = 10 (constant marker)
slot[1] = T_peak_count (× 0.005 → in/s at Normal range)
slot[2] = T_halfperiod (freq_Hz = 512 / halfp)
slot[3] = V_peak_count
slot[4] = V_halfperiod
slot[5] = L_peak_count
slot[6] = L_halfperiod
slot[7] = MicL_peak_count (dB via waveform_codec.mic_count_to_db)
slot[8] = MicL_halfperiod
The `>100 Hz` sentinel is halfperiod ≤ 5 (since 512/5 = 100 Hz).
Mic dB uses the SAME formula as the waveform codec (sign × (81.94
+ 20·log10(|count|))) — they share the mic ADC calibration constant.
Block identification anchor: bytes [22:24] == 0x0000 AND
bytes [28:32] == 1e 0a 00 00. The tail signature is the most
reliable distinguisher from non-block content in the file.
Files:
minimateplus/histogram_codec.py (new) — decoder + public API
matching the waveform codec's shape:
walk_body(body) -> records
decode_histogram_body(body) -> {Tran, Vert, Long, MicL}
decode_histogram_body_full(body) -> [per-interval dicts]
half_period_to_hz, geo_count_to_ins helpers
minimateplus/event_file_io.py (modified) — read_blastware_file
now tries the waveform codec first, falls back to the histogram
codec on failure. Same output shape, same downstream pipeline.
tests/test_histogram_codec.py (new) — 24 regression locks against
the in-repo fixture corpus, byte-exact against BW ASCII export
for peaks (all 4 channels), frequencies (all 4 channels,
including >100 Hz sentinel handling), block framing, and
segment-ID accounting.
scripts/backfill_sidecars.py (modified) — the has_samples
short-circuit added in the histogram-pending era is now a
pure defensive guard. Histograms in prod will regen .h5 files
correctly on the next backfill run.
docs/histogram_codec_re_status.md (updated) — supersedes the
earlier "in progress" version with the verified format and
test-coverage summary. Notes a few non-essential fields still
open (4-byte block metadata, Geo PVS, Mic psi(L) — none of
which are needed for waveform reconstruction).
Total verified coverage: ~3,500 blocks across 5 fixtures, every
field of every block byte-exact against BW.
The watcher-forwarded histogram event corpus on prod (~10,000
events) will now produce correct .h5 sidecars on the next backfill
run. No additional changes needed to the backfill flow — the
existing tool_version-bump cascade picks them up automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Discovered while dry-running the backfill on the prod store: ~10,000
of ~10,059 events are histogram-mode (filename extension `*H`), and
the waveform-body codec wired in via the previous commit doesn't
handle histogram-mode bodies — only the waveform-mode codec at
§7.6.1 is implemented; the histogram-mode codec at §7.6.2 of the
protocol reference is documented but no Python implementation
exists yet.
Without this guard, every histogram event's .h5 file would be
*replaced* with an empty one — strictly worse than today's
broken-int16-LE .h5 because any downstream viewer expecting
non-empty sample arrays would now error out instead of just
rendering wrong values.
Fix: after the decoder runs, check whether any channel has samples.
If not, skip the .h5 write entirely. The sidecar still regenerates
(refreshing the tool_version stamp and any peaks/project info from
the DB row), but the existing .h5 is left untouched.
This is a *temporary* gate. When the histogram codec lands (next
branch: `feat/wire-histogram-codec`), the has_samples check can be
removed and the backfill will then correctly regenerate all .h5
files, histogram and waveform alike.
Observed effect (dry-run on prod store, 10,059 events):
- waveform events (~5%): "[DRY ] would write … + .h5 (would (re)write)"
- histogram events (~95%): "[DRY ] would write … + .h5 (skipped-empty-samples)"
- sidecar tool_version bump succeeds for both
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled changes that close the rollout gap left by the
read_blastware_file codec wiring:
1. minimateplus/event_file_io.py: bump TOOL_VERSION from 0.16.1 to
0.20.0. This is the version stamp the backfill script reads from
each sidecar's source.tool_version field to detect "this sidecar
was written before the current decoder shipped, regenerate it."
Bumping past every value baked into existing prod sidecars flags
them all as stale on the next backfill run — which is exactly what
we want, since every pre-codec-wiring sidecar was written by the
retracted int16-LE decoder.
2. scripts/backfill_sidecars.py: when the sidecar is being
regenerated this iteration (sha mismatch, tool_version too old,
or --force), also regenerate the .h5. Previously the .h5 logic
only rewrote when --force was passed or the file was missing —
so a tool_version-driven sidecar regen left the broken .h5 in
place forever. Added a `sidecar_stale` boolean to track the
"we're rewriting the sidecar this iteration" state and wired it
into the h5 need-rewrite check.
Path coverage (verified by trace):
- sidecar missing → both regen
- --force → both regen
- sha mismatch → both regen
- tool_ver too old → both regen (THE post-codec-wiring case)
- everything OK → skip iteration entirely (h5 untouched)
Operator review state (review.false_trigger, reviewer, notes) and
the sidecar's extensions block are preserved across regen by the
existing read-existing-sidecar / pass-into-event_to_sidecar_dict
path — unchanged from prior behavior.
Deploy procedure (on prod):
1. Pull this change + the read_blastware_file codec wiring.
2. `python scripts/backfill_sidecars.py --dry-run` to preview.
Every sidecar with source.tool_version<0.20.0 will show as
"would (re)write".
3. Run for real (drop --dry-run). Expect every pre-fix event
to regen. Big stores may take a while.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
### Added
- **Layered event storage architecture.** Each event now lands as four
files in the per-serial waveform store, each with a clear role:
- `<filename>` — the Blastware-readable binary (BW file). Untouched.
- `<filename>.a5.pkl` — the raw 5A frames (regenerative source).
- `<filename>.h5` — clean per-channel waveform arrays in physical
units (in/s for geo, psi for mic) plus event metadata (HDF5 with
gzip compression). This is the canonical format for downstream
analysis tools.
- `<filename>.sfm.json` — the modern review/metadata sidecar (peaks,
project, source provenance, review state, extensions).
SQLite (`seismo_relay.db`) is the searchable index over all four.
- **Plot-ready waveform JSON (`sfm.plot.v1`).** The `/device/event/{idx}/waveform`
and `/db/events/{id}/waveform.json` endpoints now return samples in
physical units with explicit time-axis metadata, peak markers, and
per-channel unit hints — no more guessing the ADC-to-velocity scale
client-side. The webapp waveform viewer was rewritten to consume
this shape.
- **In-app waveform viewer accuracy fix.** The standalone SFM webapp
viewer was scaling geophone amplitudes by `geoAdcScale / 32767`
(≈ 6.206 / 32767), where `geoAdcScale = 6.206053` is the device's
*in/s per V* hardware constant — not the ADC-counts-to-velocity
factor. This silently scaled every plot ~38% too low for Normal-range
geophones (the correct full-scale is 10.0 in/s, or 1.25 in/s for
Sensitive). Conversion is now done server-side using the geo_range
from compliance config; the client just plots.
- New `sfm/event_hdf5.py` module: `write_event_hdf5()`,
`read_event_hdf5()`, plus a plot-JSON helper.
- Backfill script extended to also emit `.h5` for existing events.
### Dependencies
- Added `h5py>=3.10` and `numpy>=1.24` for the HDF5 storage layer.
- Added `python-multipart>=0.0.7` (required by FastAPI for the
`/db/import/blastware_file` endpoint introduced in this release).