docs: histogram body codec RE — starting-point status doc

Captures everything learned in the 2026-05-20 session before scope forced a pause: - Block framing is solved: 32-byte blocks, one per histogram interval, signature byte pattern `[22:24]=0x0000` + `[28:32]=0x1e 0x0a 0x00 0x00` reliably identifies data blocks. - Block count = interval count (791 blocks in N844L20G.630H for a TXT-reported 792 intervals). - Sample[0] = Tran peak in 0.0005 in/s/count units (verified on one event — needs cross-event confirmation). - Samples 1-8 → channel/metric mapping is still open. None of the obvious layouts (peak-then-freq alternating, all-peaks- then-all-freqs, per-channel 3-tuples) match the TXT values across multiple blocks. Likely needs a higher-activity fixture (current N844 corpus is all noise-floor data) to disambiguate. - `>100 Hz` sentinel encoding in the binary is unknown. - 4-byte variable metadata field at block[24:28] needs correlation work against TXT columns. Doc mirrors the structure of docs/waveform_codec_re_status.md so a future RE session has a familiar entry point. Includes the suggested attack plan + the code seam where the eventual decoder will land (minimateplus/histogram_codec.py). The §7.6.2 spec in instantel_protocol_reference.md is structurally correct but doesn't pin down per-sample semantics — this doc supersedes it where they conflict on confidence level. No code shipped on this branch. When the codec is cracked, the plan is to land minimateplus/histogram_codec.py + wire into event_file_io.read_blastware_file() + remove the has_samples short-circuit from scripts/backfill_sidecars.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
read_blastware_file: leave peak_values=None when samples can't be decoded
2026-05-20 21:13:26 +00:00 · 2026-05-20 20:30:53 +00:00 · 2026-05-20 20:16:31 +00:00 · 2026-05-20 19:58:54 +00:00
5 changed files with 259 additions and 8 deletions
@@ -8,8 +8,10 @@ RUN apt-get update && \

 COPY pyproject.toml requirements.txt ./
 COPY minimateplus ./minimateplus
+COPY micromate    ./micromate
 COPY sfm          ./sfm
 COPY bridges      ./bridges
+COPY scripts      ./scripts

 RUN pip install --no-cache-dir -e .

@@ -0,0 +1,212 @@
+# Histogram body codec — IN PROGRESS (started 2026-05-20)
+
+Working notes for the Series III histogram-mode event body codec
+reverse-engineering effort.  Mirrors the structure of
+`waveform_codec_re_status.md` (the now-completed waveform codec).  The
+historical context lives in `docs/instantel_protocol_reference.md
+§7.6.2`; this doc is the active scratchpad.
+
+## TL;DR (current state)
+
+**Block framing is solved.  Sample-to-channel mapping is open.**
+
+| Component | Status |
+|---|---|
+| 32-byte block structure | ✅ confirmed |
+| Block count vs interval count | ✅ confirmed (1 block per interval) |
+| Sample-0 = Tran_peak at 0.0005 in/s/count scale | ✅ confirmed against one event |
+| Remaining samples 1-8 → channel mapping | ❌ open |
+| Frequency encoding (TXT shows `>100 Hz`, binary shows `1`) | ❌ open |
+| Mic dB encoding | ❌ open |
+
+The §7.6.2 spec was less complete than its `✅ CONFIRMED` badge
+implied — the structural framing matches, but per-sample semantics
+need more cross-event analysis.
+
+## Confirmed structure (2026-05-20)
+
+### Body layout
+
+```
+body = [stream of 32-byte blocks]
+```
+
+Body length isn't always a multiple of 32 — observed 1-byte and
+9-byte trailing remnants.  Walker should iterate 32-stride and stop
+before the tail.
+
+### 32-byte block header
+
+```
+[0]    0x00                   always-zero (probably a fixed format tag)
+[1]    segment_id (uint8)     0x00, 0x01, 0x02, 0x03 — 256 blocks per segment
+[2:4]  block_ctr (uint16 LE)  resets each segment (0x0100, 0x0101, ...)
+[4:22] 9× int16 LE samples
+[22:24] 0x00 0x00              constant
+[24:28] 4-byte variable        unknown — possibly timestamp delta or CRC
+[28:30] 0x1e 0x0a              constant signature (`30, 10`)
+[30:32] 0x00 0x00              constant
+```
+
+Anchor for finding data blocks during a body walk: `block[22:24] ==
+b"\x00\x00"` AND `block[28:32] == b"\x1e\x0a\x00\x00"`.  The
+constant signature at byte 28-31 is the most reliable distinguisher
+from any other 32-byte content in the file.
+
+### Block count = interval count
+
+Confirmed against `example-events/histogram/N844L20G.630H`:
+- TXT reports `Number of Intervals : 792.00`
+- Binary contains 791 data blocks (one per interval, off-by-one at
+  the tail — probably the last interval is truncated mid-write at
+  recording stop)
+
+Implication: each block represents exactly one histogram interval
+(1 minute in this fixture, configurable per device).  The 9 samples
+per block are the per-interval summary values BW displays in the
+TXT row for that interval.
+
+### What sample 0 means
+
+Confirmed: `sample[0] / 2000 = Tran peak amplitude in in/s` for
+the Normal-range geophone.  Equivalently, sample[0] is in units of
+**0.0005 in/s per count** (NOT the 0.005 in/s display quantum or the
+1-count ADC quantum).
+
+Verified for block 0 of N844L20G.630H:
+- binary sample[0] = 10
+- TXT Tran_peak[0]  = 0.005 in/s
+- check: 10 × 0.0005 = 0.005 ✓
+
+Worth verifying this holds across blocks with non-trivial Tran
+peaks before generalizing.
+
+## Open mappings
+
+### Samples 1-8 → channel + metric
+
+TXT structure is **10 columns per interval**:
+
+```
+Tran  Tran  Vert  Vert  Long  Long  Geo   MicL  MicL   MicL
+Peak  Freq  Peak  Freq  Peak  Freq  PVS   psi   dB(L)  Freq
+in/s  Hz    in/s  Hz    in/s  Hz    in/s  psi   dB     Hz
+```
+
+Binary has **9 samples per block** (one short of the column count).
+None of the obvious mappings work:
+
+| Hypothesis | Why it fails |
+|---|---|
+| (T_peak, T_freq, V_peak, V_freq, L_peak, L_freq, Geo, M_peak, M_freq) | Sample[1]=1 doesn't decode to `>100 Hz` under any obvious scale |
+| (T_peak, V_peak, L_peak, T_freq, V_freq, L_freq, Geo, M_peak, M_freq) | V_peak should be 1 → 0.005 in/s but is 1 → would compute 0.0005, TXT shows 0.005 for some intervals, 0.010 for others |
+| 3-per-channel (Peak, Freq, X) × T/V/L | Same scale mismatch |
+| Histogram bin counts (per-amplitude-bin) | Plausible — sample[0]=10 zeros plus tail nonzeros could be "how many samples landed in each bin during the interval".  But then sample[0] = T_peak coincidence is suspicious. |
+
+`>100 Hz` is a sentinel BW writes when the measured zero-crossing
+frequency exceeds the geophone's measurement range.  The binary
+encoding of this sentinel is unknown.  Common candidates:
+- Special value (e.g. 0xFFFF / 0x7FFF / 0)
+- A flag bit in the metadata bytes (especially the 4-byte variable
+  field at [24:28])
+
+### Metadata 4-byte variable field (bytes 24:28)
+
+Examples from the first 8 blocks of N844L20G.630H:
+```
+block 0: 03 90 2a 00
+block 1: 04 f2 84 00
+block 2: 03 2b e7 00
+block 3: 03 fe 11 00
+block 4: 03 f7 91 00
+block 5: 03 e9 4e 00
+block 6: 03 4c 5c 00
+block 7: 03 99 aa 00
+```
+
+First byte is mostly `0x03` (blocks 0,2-7) and sometimes `0x04` (block
+1).  Could be a CRC, timestamp delta, or per-interval status byte.
+Worth correlating against TXT columns that vary block-to-block.
+
+## Fixture corpus
+
+In-repo histogram fixtures (paired binary + ASCII TXT):
+
+```
+example-events/histogram/N844L20G.630H       (27 KB, 791 blocks, 792 intervals)
+example-events/histogram/N844L21H.2R0H       (22 KB)
+example-events/histogram/N844L22A.VT0H       (27 KB)
+example-events/histogram/N844L23B.ND0H       ...
+example-events/histogram/N844L27U.U30H       ...
+example-events/histogram/N844L28V.NA0H       ...
+example-events/histogram/N844L6QT.IQ0H       ...
+example-events/histogram/N844L6RU.BO0H       ...
+example-events/histogram/N844L6SO.6I0H       ...
+example-events/histogram/N844L6TP.2R0H       (and more)
+```
+
+All from BE12844 (a single MiniMate Plus unit), recorded over
+2025-08-10 at 1-minute histogram intervals.  All "noise floor"
+events — mostly silent intervals with rare spikes.
+
+Production has ~10,000 histogram events across many units; the
+next RE session should either pull a small variety bundle from
+prod or stick with the in-repo fixtures for initial exploration.
+
+## Suggested attack plan for next session
+
+1. **Verify sample[0] = T_peak hypothesis across all 791 blocks
+   of N844L20G.630H** — confirms the scale factor isn't a coincidence.
+2. **Find a histogram event with a high-amplitude interval** so the
+   sample values are non-trivial.  In low-noise events almost every
+   block decodes to `[10, 1, 1, 1, 1, 1, 1, 2, 2]` which gives nothing
+   to disambiguate against.
+3. **Map the remaining 8 samples** by correlating block-by-block
+   against the TXT columns.  Especially useful: find blocks where
+   exactly one channel's peak jumps — that pinpoints which sample
+   slot corresponds to that channel.
+4. **Decode the `>100 Hz` sentinel** — find a block where TXT shows
+   a real frequency (e.g. `73.1 Hz`) and reverse the binary value.
+5. **Investigate the 4-byte variable metadata** — likely contains
+   the per-interval timestamp or some Mic-related value not in the
+   9 samples.
+6. **Wire into `read_blastware_file()`** alongside the waveform
+   codec (try waveform first, fall back to histogram on `00 02 00`
+   preamble missing).
+7. **Update `scripts/backfill_sidecars.py`** to remove the
+   `has_samples` short-circuit so histogram `.h5` files regenerate
+   too.
+
+## Code seam for the eventual decoder
+
+`minimateplus/histogram_codec.py` (to-be-created) should mirror
+`minimateplus/waveform_codec.py`:
+
+```python
+def decode_histogram_body(body: bytes) -> Optional[dict]:
+    """Decode a histogram-mode body into per-channel sample arrays.
+
+    Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}``
+    with each channel's per-interval peak values in ADC counts.
+    Returns ``None`` if the body cannot be parsed.
+    """
+```
+
+Then in `event_file_io.read_blastware_file()`:
+
+```python
+decoded = decode_waveform_v2(body)
+if decoded is None:
+    decoded = decode_histogram_body(body)
+if decoded is None:
+    log.warning(...)
+    samples = {"Tran": [], ...}
+else:
+    samples = decoded_to_adc_counts(decoded)
+```
+
+## Related work
+
+- Waveform body codec — `docs/waveform_codec_re_status.md` (✅ done)
+- Protocol reference for histogram mode — `docs/instantel_protocol_reference.md §7.6.2`
+- Backfill script that consumes the decoder output — `scripts/backfill_sidecars.py`
@@ -811,7 +811,18 @@ def read_blastware_file(path: Union[str, Path]) -> Event:
        project=project, client=client, operator=user, sensor_location=seisloc,
    )
    ev.raw_samples = samples
-    ev.peak_values = _peaks_from_samples(samples)
+    # Only compute peaks from samples when we actually have samples.
+    # For events the codec couldn't decode (histogram-mode bodies, until
+    # the §7.6.2 histogram codec is wired in), samples is an empty dict
+    # and ``_peaks_from_samples`` would return PeakValues(0, 0, 0, 0, 0).
+    # That would then OVERWRITE existing good DB peak values (e.g. from
+    # paired BW ASCII reports) during the backfill UPSERT path.
+    # Leaving peak_values=None signals "we don't know" to downstream
+    # consumers; the backfill script seeds from the DB row when it sees
+    # None, and ``apply_report_to_event`` overlays from a paired ASCII
+    # report when one is supplied.
+    has_samples = any(samples.get(ch) for ch in ("Tran", "Vert", "Long", "MicL"))
+    ev.peak_values = _peaks_from_samples(samples) if has_samples else None
    ev._a5_frames = None  # not recoverable from BW file

    return ev
@@ -311,12 +311,32 @@ def main(argv=None) -> int:
                #     int16-LE codec era — bumping TOOL_VERSION to 0.20.0+
                #     marks every pre-codec sidecar stale, which now
                #     correctly cascades to .h5 regeneration too.
+                #
+                # Skip the .h5 write when the decoder couldn't produce
+                # samples — this is the histogram-mode case today
+                # (waveform_codec.decode_waveform_v2 only handles the
+                # waveform-mode body format per §7.6.1; the histogram
+                # codec at §7.6.2 is documented but not yet implemented).
+                # Without this check we'd replace the existing (broken
+                # int16-LE) histogram .h5 with an empty one, which is
+                # arguably worse for any consumer expecting non-empty
+                # sample arrays.  When the histogram codec lands, this
+                # check can come out.
+                has_samples = bool(
+                    ev.raw_samples and any(
+                        ev.raw_samples.get(ch) for ch in ("Tran", "Vert", "Long", "MicL")
+                    )
+                )
                hdf5_path = store.hdf5_path_for(serial, path.name)
                hdf5_filename = hdf5_path.name if hdf5_path.exists() else None
                hdf5_action = "kept"
-                need_h5 = not args.skip_hdf5 and (
-                    args.force or not hdf5_path.exists() or sidecar_stale
+                need_h5 = (
+                    not args.skip_hdf5
+                    and (args.force or not hdf5_path.exists() or sidecar_stale)
+                    and has_samples
                )
+                if not has_samples and not args.skip_hdf5:
+                    hdf5_action = "skipped-empty-samples"
                if need_h5:
                    if args.dry_run:
                        hdf5_action = "would (re)write"
@@ -289,9 +289,15 @@ def test_read_blastware_file_round_trip(tmp_path: Path):
    assert parsed.timestamp.second == ev.timestamp.second
    # No A5 source recoverable.
    assert parsed._a5_frames is None
-    # Peaks computed from samples (synthetic = zero samples → zero peaks).
-    assert parsed.peak_values is not None
-    assert parsed.peak_values.peak_vector_sum == 0.0
+    # The synthetic event has no real waveform body, so the codec can't
+    # decode samples → read_blastware_file leaves peak_values=None
+    # (the "we don't know" signal) rather than fabricating all-zero
+    # peaks that would otherwise overwrite real DB values via UPSERT.
+    assert parsed.peak_values is None
+    assert parsed.raw_samples is not None
+    # Empty channels — codec returned None for the malformed synthetic body.
+    for ch in ("Tran", "Vert", "Long", "MicL"):
+        assert parsed.raw_samples[ch] == []


 _BW_CODEC_FIXTURES = [
Author	SHA1	Message	Date
serversdown	c3c7fe559c	docs: histogram body codec RE — starting-point status doc Captures everything learned in the 2026-05-20 session before scope forced a pause: - Block framing is solved: 32-byte blocks, one per histogram interval, signature byte pattern `[22:24]=0x0000` + `[28:32]=0x1e 0x0a 0x00 0x00` reliably identifies data blocks. - Block count = interval count (791 blocks in N844L20G.630H for a TXT-reported 792 intervals). - Sample[0] = Tran peak in 0.0005 in/s/count units (verified on one event — needs cross-event confirmation). - Samples 1-8 → channel/metric mapping is still open. None of the obvious layouts (peak-then-freq alternating, all-peaks- then-all-freqs, per-channel 3-tuples) match the TXT values across multiple blocks. Likely needs a higher-activity fixture (current N844 corpus is all noise-floor data) to disambiguate. - `>100 Hz` sentinel encoding in the binary is unknown. - 4-byte variable metadata field at block[24:28] needs correlation work against TXT columns. Doc mirrors the structure of docs/waveform_codec_re_status.md so a future RE session has a familiar entry point. Includes the suggested attack plan + the code seam where the eventual decoder will land (minimateplus/histogram_codec.py). The §7.6.2 spec in instantel_protocol_reference.md is structurally correct but doesn't pin down per-sample semantics — this doc supersedes it where they conflict on confidence level. No code shipped on this branch. When the codec is cracked, the plan is to land minimateplus/histogram_codec.py + wire into event_file_io.read_blastware_file() + remove the has_samples short-circuit from scripts/backfill_sidecars.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 21:13:26 +00:00
serversdown	fa9d3cdef2	read_blastware_file: leave peak_values=None when samples can't be decoded Fixes a data-loss bug discovered while dry-running the backfill against the prod store. Symptom: every histogram event in the store has its body decoded by read_blastware_file → codec returns None → samples = empty dict → ``ev.peak_values = _peaks_from_samples(empty)`` returns ``PeakValues(0, 0, 0, 0, 0)`` (NOT None). The backfill script's existing "seed from DB row when peak_values is None" branch then correctly skips the seeding, and the all-zeros PeakValues flows into ``db.insert_events()``'s UPSERT path, OVERWRITING the existing good DB peak values for that event (which were populated from the paired BW ASCII report at ingest). Net effect: running the backfill on prod would have wiped the PPV / mic / vector-sum columns for ~10,000 histogram events. Fix: only compute peaks-from-samples when there are actually samples. For events the codec couldn't decode (histogram-mode bodies, until the §7.6.2 histogram codec is wired in), leave peak_values=None as the "we don't know" signal. Downstream consumers: - backfill_sidecars.py — its existing ``if ev.peak_values is None:`` branch (line 243) seeds from the DB row, preserving the real BW-report peaks across the regen. - WaveformStore.save_imported_bw — apply_report_to_event overlays peaks from the paired BW ASCII report when one was uploaded. Histogram imports without a paired report end up with NULL peaks in the DB, which is correct (better than zeros — clearly says "no peak data available" rather than "peaks are exactly zero"). Updated the existing synthetic-event round-trip test to expect peak_values=None for the no-real-body case, which is the truth now. The 7 fixture-corpus regression tests for real BW waveforms continue to pass — those have decodable samples, so peak_values is still populated from the codec output as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:30:53 +00:00
serversdown	c4648c1959	scripts/backfill_sidecars: skip .h5 write when decoder returned no samples Discovered while dry-running the backfill on the prod store: ~10,000 of ~10,059 events are histogram-mode (filename extension `H`), and the waveform-body codec wired in via the previous commit doesn't handle histogram-mode bodies — only the waveform-mode codec at §7.6.1 is implemented; the histogram-mode codec at §7.6.2 of the protocol reference is documented but no Python implementation exists yet. Without this guard, every histogram event's .h5 file would be replaced* with an empty one — strictly worse than today's broken-int16-LE .h5 because any downstream viewer expecting non-empty sample arrays would now error out instead of just rendering wrong values. Fix: after the decoder runs, check whether any channel has samples. If not, skip the .h5 write entirely. The sidecar still regenerates (refreshing the tool_version stamp and any peaks/project info from the DB row), but the existing .h5 is left untouched. This is a temporary gate. When the histogram codec lands (next branch: `feat/wire-histogram-codec`), the has_samples check can be removed and the backfill will then correctly regenerate all .h5 files, histogram and waveform alike. Observed effect (dry-run on prod store, 10,059 events): - waveform events (~5%): "[DRY ] would write … + .h5 (would (re)write)" - histogram events (~95%): "[DRY ] would write … + .h5 (skipped-empty-samples)" - sidecar tool_version bump succeeds for both Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 20:16:31 +00:00
serversdown	0e89125495	docker: fix dockerfile to include scripts and micromate folders	2026-05-20 19:58:54 +00:00