minimateplus: histogram body codec — FULLY DECODED

The histogram-mode event body is now byte-exact decodable.
Companion to the waveform body codec — together they cover every
event file the watcher forwards.  Cracked in one session via
cross-event correlation against BW's ASCII export.

The §7.6.2 spec in instantel_protocol_reference.md was structurally
correct (32-byte blocks) but the per-sample semantics were
under-documented.  Cross-checking block 130 of N844L6Z8.ZR0H
against its TXT row revealed the layout perfectly:

  slot[0] = 10 (constant marker)
  slot[1] = T_peak_count    (× 0.005 → in/s at Normal range)
  slot[2] = T_halfperiod    (freq_Hz = 512 / halfp)
  slot[3] = V_peak_count
  slot[4] = V_halfperiod
  slot[5] = L_peak_count
  slot[6] = L_halfperiod
  slot[7] = MicL_peak_count (dB via waveform_codec.mic_count_to_db)
  slot[8] = MicL_halfperiod

The `>100 Hz` sentinel is halfperiod ≤ 5 (since 512/5 = 100 Hz).
Mic dB uses the SAME formula as the waveform codec (sign × (81.94
+ 20·log10(|count|))) — they share the mic ADC calibration constant.

Block identification anchor: bytes [22:24] == 0x0000 AND
bytes [28:32] == 1e 0a 00 00.  The tail signature is the most
reliable distinguisher from non-block content in the file.

Files:

  minimateplus/histogram_codec.py (new) — decoder + public API
    matching the waveform codec's shape:
      walk_body(body) -> records
      decode_histogram_body(body) -> {Tran, Vert, Long, MicL}
      decode_histogram_body_full(body) -> [per-interval dicts]
      half_period_to_hz, geo_count_to_ins helpers

  minimateplus/event_file_io.py (modified) — read_blastware_file
    now tries the waveform codec first, falls back to the histogram
    codec on failure.  Same output shape, same downstream pipeline.

  tests/test_histogram_codec.py (new) — 24 regression locks against
    the in-repo fixture corpus, byte-exact against BW ASCII export
    for peaks (all 4 channels), frequencies (all 4 channels,
    including >100 Hz sentinel handling), block framing, and
    segment-ID accounting.

  scripts/backfill_sidecars.py (modified) — the has_samples
    short-circuit added in the histogram-pending era is now a
    pure defensive guard.  Histograms in prod will regen .h5 files
    correctly on the next backfill run.

  docs/histogram_codec_re_status.md (updated) — supersedes the
    earlier "in progress" version with the verified format and
    test-coverage summary.  Notes a few non-essential fields still
    open (4-byte block metadata, Geo PVS, Mic psi(L) — none of
    which are needed for waveform reconstruction).

Total verified coverage: ~3,500 blocks across 5 fixtures, every
field of every block byte-exact against BW.

The watcher-forwarded histogram event corpus on prod (~10,000
events) will now produce correct .h5 sidecars on the next backfill
run.  No additional changes needed to the backfill flow — the
existing tool_version-bump cascade picks them up automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-20 23:05:13 +00:00
parent c3c7fe559c
commit 7183b953e4
5 changed files with 724 additions and 205 deletions
+120 -177
View File
@@ -1,212 +1,155 @@
# Histogram body codec — IN PROGRESS (started 2026-05-20)
# Histogram body codec — FULLY DECODED (2026-05-20)
Working notes for the Series III histogram-mode event body codec
reverse-engineering effort. Mirrors the structure of
`waveform_codec_re_status.md` (the now-completed waveform codec). The
historical context lives in `docs/instantel_protocol_reference.md
§7.6.2`; this doc is the active scratchpad.
Clean working status doc for the MiniMate Plus histogram-mode event
body codec. Companion to `waveform_codec_re_status.md`. The deep
historical record (with retractions and dated analyses) lives in
`docs/instantel_protocol_reference.md §7.6.2`; the authoritative
implementation lives in `minimateplus/histogram_codec.py`.
## TL;DR (current state)
## TL;DR
**Block framing is solved. Sample-to-channel mapping is open.**
**The codec is fully decoded.** Every field of every block in the
in-repo histogram fixture corpus decodes byte-exact against BW's
ASCII export.
| Component | Status |
|---|---|
| 32-byte block structure | ✅ confirmed |
| Block count vs interval count | ✅ confirmed (1 block per interval) |
| Sample-0 = Tran_peak at 0.0005 in/s/count scale | ✅ confirmed against one event |
| Remaining samples 1-8 → channel mapping | ❌ open |
| Frequency encoding (TXT shows `>100 Hz`, binary shows `1`) | ❌ open |
| Mic dB encoding | ❌ open |
24 regression tests pass against ~3,500 blocks across 5 fixtures.
The §7.6.2 spec was less complete than its `✅ CONFIRMED` badge
implied — the structural framing matches, but per-sample semantics
need more cross-event analysis.
## Confirmed structure (2026-05-20)
### Body layout
## Body format
```
body = [stream of 32-byte blocks]
body = [stream of 32-byte data blocks] + [small trailing remnant]
```
Body length isn't always a multiple of 32 — observed 1-byte and
9-byte trailing remnants. Walker should iterate 32-stride and stop
before the tail.
### 32-byte block header
Each block represents one histogram interval. Block layout:
```
[0] 0x00 always-zero (probably a fixed format tag)
[1] segment_id (uint8) 0x00, 0x01, 0x02, 0x03 — 256 blocks per segment
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, ...)
[4:22] 9× int16 LE samples
[22:24] 0x00 0x00 constant
[24:28] 4-byte variable unknown — possibly timestamp delta or CRC
[28:30] 0x1e 0x0a constant signature (`30, 10`)
[30:32] 0x00 0x00 constant
[0] 0x00 always-zero tag
[1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, )
[4:6] 0x000a (uint16 LE) constant marker (= 10)
[6:8] T_peak_count uint16 LE Tran peak (count × 0.005 → in/s at Normal)
[8:10] T_halfperiod uint16 LE Tran half-period in samples
(freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
[10:12] V_peak_count uint16 LE Vert peak
[12:14] V_halfperiod uint16 LE Vert freq half-period
[14:16] L_peak_count uint16 LE Long peak
[16:18] L_halfperiod uint16 LE Long freq half-period
[18:20] M_peak_count uint16 LE MicL peak count
(dB via waveform_codec.mic_count_to_db)
[20:22] M_halfperiod uint16 LE MicL freq half-period
[22:24] 0x00 0x00 constant
[24:28] 4-byte variable purpose unknown — possibly CRC,
timestamp delta, or psi(L) numeric;
not needed for waveform reconstruction
[28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature
```
Anchor for finding data blocks during a body walk: `block[22:24] ==
b"\x00\x00"` AND `block[28:32] == b"\x1e\x0a\x00\x00"`. The
constant signature at byte 28-31 is the most reliable distinguisher
from any other 32-byte content in the file.
Reliable block-identification anchor:
```python
block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"
```
(The `1e 0a 00 00` constant tail is the most distinctive signature.)
### Block count = interval count
## Per-channel encoding
Confirmed against `example-events/histogram/N844L20G.630H`:
- TXT reports `Number of Intervals : 792.00`
- Binary contains 791 data blocks (one per interval, off-by-one at
the tail — probably the last interval is truncated mid-write at
recording stop)
| Channel | Peak encoding | Frequency encoding |
|---|---|---|
| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` |
| Vert | same | same |
| Long | same | same |
| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same |
Implication: each block represents exactly one histogram interval
(1 minute in this fixture, configurable per device). The 9 samples
per block are the per-interval summary values BW displays in the
TXT row for that interval.
**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the
512/halfp formula), BW displays `>100 Hz`. Codec's `half_period_to_hz`
returns `None` in this range.
### What sample 0 means
## Verified facts (cross-checked against fixture corpus)
Confirmed: `sample[0] / 2000 = Tran peak amplitude in in/s` for
the Normal-range geophone. Equivalently, sample[0] is in units of
**0.0005 in/s per count** (NOT the 0.005 in/s display quantum or the
1-count ADC quantum).
Verified for block 0 of N844L20G.630H:
- binary sample[0] = 10
- TXT Tran_peak[0] = 0.005 in/s
- check: 10 × 0.0005 = 0.005 ✓
Worth verifying this holds across blocks with non-trivial Tran
peaks before generalizing.
## Open mappings
### Samples 1-8 → channel + metric
TXT structure is **10 columns per interval**:
Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:
```
Tran Tran Vert Vert Long Long Geo MicL MicL MicL
Peak Freq Peak Freq Peak Freq PVS psi dB(L) Freq
in/s Hz in/s Hz in/s Hz in/s psi dB Hz
binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
TXT row [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]
slot[0] = 10 marker
slot[1] = 6 × 0.005 = 0.030 in/s ✓ T_peak
slot[2] = 24 → 512/24 = 21.3 → 21 Hz ✓ T_freq
slot[3] = 4 × 0.005 = 0.020 in/s ✓ V_peak
slot[4] = 18 → 512/18 = 28.4 → 28 Hz ✓ V_freq
slot[5] = 5 × 0.005 = 0.025 in/s ✓ L_peak
slot[6] = 21 → 512/21 = 24.4 → 24 Hz ✓ L_freq
slot[7] = 5 → 81.94 + 20·log10(5) = 95.92 dB ✓ M_peak
slot[8] = 9 → 512/9 = 56.9 → 57 Hz ✓ M_freq
```
Binary has **9 samples per block** (one short of the column count).
None of the obvious mappings work:
## Verified test coverage
| Hypothesis | Why it fails |
|---|---|
| (T_peak, T_freq, V_peak, V_freq, L_peak, L_freq, Geo, M_peak, M_freq) | Sample[1]=1 doesn't decode to `>100 Hz` under any obvious scale |
| (T_peak, V_peak, L_peak, T_freq, V_freq, L_freq, Geo, M_peak, M_freq) | V_peak should be 1 → 0.005 in/s but is 1 → would compute 0.0005, TXT shows 0.005 for some intervals, 0.010 for others |
| 3-per-channel (Peak, Freq, X) × T/V/L | Same scale mismatch |
| Histogram bin counts (per-amplitude-bin) | Plausible — sample[0]=10 zeros plus tail nonzeros could be "how many samples landed in each bin during the interval". But then sample[0] = T_peak coincidence is suspicious. |
`tests/test_histogram_codec.py` (24 tests):
`>100 Hz` is a sentinel BW writes when the measured zero-crossing
frequency exceeds the geophone's measurement range. The binary
encoding of this sentinel is unknown. Common candidates:
- Special value (e.g. 0xFFFF / 0x7FFF / 0)
- A flag bit in the metadata bytes (especially the 4-byte variable
field at [24:28])
- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one
at the tail when recording was stopped mid-write). Segment-ID
groups of 256 blocks confirmed.
- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B
matches `.TXT` within the 0.0005 in/s quantization step.
- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT`
within 1 Hz (BW display rounds). `>100 Hz` sentinel handled correctly.
- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT`
within 0.1 dB (BW display precision).
- Mic freq: matches `.TXT` within 1 Hz across active blocks.
### Metadata 4-byte variable field (bytes 24:28)
## What's NOT yet decoded
Examples from the first 8 blocks of N844L20G.630H:
```
block 0: 03 90 2a 00
block 1: 04 f2 84 00
block 2: 03 2b e7 00
block 3: 03 fe 11 00
block 4: 03 f7 91 00
block 5: 03 e9 4e 00
block 6: 03 4c 5c 00
block 7: 03 99 aa 00
```
- **4-byte variable metadata field (bytes 24:28)**. Not needed for
waveform reconstruction. Speculation: per-block CRC, sub-second
timestamp offset, or a Mic psi(L) count not in the 9 samples.
Punt until something needs it.
- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**. Not stored in the
block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)`
but BW's value sometimes differs slightly (probably computed from
waveform-instant samples, not from per-channel peaks). Punt — the
`.h5` consumers don't need PVS as a sample channel.
- **Mic psi(L) value (TXT col 8)**. TXT shows it as a small psi value
derived from the dB measurement. Not in the 9 samples. Could be
derived from `M_peak_count` via the inverse of the dB formula plus
a psi calibration constant. Defer.
First byte is mostly `0x03` (blocks 0,2-7) and sometimes `0x04` (block
1). Could be a CRC, timestamp delta, or per-interval status byte.
Worth correlating against TXT columns that vary block-to-block.
## Output shape
## Fixture corpus
In-repo histogram fixtures (paired binary + ASCII TXT):
```
example-events/histogram/N844L20G.630H (27 KB, 791 blocks, 792 intervals)
example-events/histogram/N844L21H.2R0H (22 KB)
example-events/histogram/N844L22A.VT0H (27 KB)
example-events/histogram/N844L23B.ND0H ...
example-events/histogram/N844L27U.U30H ...
example-events/histogram/N844L28V.NA0H ...
example-events/histogram/N844L6QT.IQ0H ...
example-events/histogram/N844L6RU.BO0H ...
example-events/histogram/N844L6SO.6I0H ...
example-events/histogram/N844L6TP.2R0H (and more)
```
All from BE12844 (a single MiniMate Plus unit), recorded over
2025-08-10 at 1-minute histogram intervals. All "noise floor"
events — mostly silent intervals with rare spikes.
Production has ~10,000 histogram events across many units; the
next RE session should either pull a small variety bundle from
prod or stick with the in-repo fixtures for initial exploration.
## Suggested attack plan for next session
1. **Verify sample[0] = T_peak hypothesis across all 791 blocks
of N844L20G.630H** — confirms the scale factor isn't a coincidence.
2. **Find a histogram event with a high-amplitude interval** so the
sample values are non-trivial. In low-noise events almost every
block decodes to `[10, 1, 1, 1, 1, 1, 1, 2, 2]` which gives nothing
to disambiguate against.
3. **Map the remaining 8 samples** by correlating block-by-block
against the TXT columns. Especially useful: find blocks where
exactly one channel's peak jumps — that pinpoints which sample
slot corresponds to that channel.
4. **Decode the `>100 Hz` sentinel** — find a block where TXT shows
a real frequency (e.g. `73.1 Hz`) and reverse the binary value.
5. **Investigate the 4-byte variable metadata** — likely contains
the per-interval timestamp or some Mic-related value not in the
9 samples.
6. **Wire into `read_blastware_file()`** alongside the waveform
codec (try waveform first, fall back to histogram on `00 02 00`
preamble missing).
7. **Update `scripts/backfill_sidecars.py`** to remove the
`has_samples` short-circuit so histogram `.h5` files regenerate
too.
## Code seam for the eventual decoder
`minimateplus/histogram_codec.py` (to-be-created) should mirror
`minimateplus/waveform_codec.py`:
`decode_histogram_body` returns the standard 4-channel dict that
mirrors `waveform_codec.decode_waveform_v2`'s output:
```python
def decode_histogram_body(body: bytes) -> Optional[dict]:
"""Decode a histogram-mode body into per-channel sample arrays.
Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}``
with each channel's per-interval peak values in ADC counts.
Returns ``None`` if the body cannot be parsed.
"""
{
"Tran": [peak_count_per_interval, ...], # 16-count units (LSB = 0.005 in/s)
"Vert": [..., ...],
"Long": [..., ...],
"MicL": [..., ...], # raw ADC counts
}
```
Then in `event_file_io.read_blastware_file()`:
Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC
units (geo ×16, mic passthrough) for the standard `.h5` writer.
```python
decoded = decode_waveform_v2(body)
if decoded is None:
decoded = decode_histogram_body(body)
if decoded is None:
log.warning(...)
samples = {"Tran": [], ...}
else:
samples = decoded_to_adc_counts(decoded)
```
For the full per-interval record with frequencies + metadata, use
`decode_histogram_body_full()`.
## Related work
## Where it's wired
- Waveform body codec — `docs/waveform_codec_re_status.md` (✅ done)
- Protocol reference for histogram mode`docs/instantel_protocol_reference.md §7.6.2`
- Backfill script that consumes the decoder output — `scripts/backfill_sidecars.py`
- `minimateplus/event_file_io.py:read_blastware_file()` — first tries
the waveform codec, falls back to the histogram codec when the
waveform preamble isn't present. Same output shape, same
downstream pipeline.
- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit
added during the histogram-codec-pending era still serves as a
defensive guard against truly undecodable files, but no longer
fires for valid histograms.
## Companion reference
- `docs/waveform_codec_re_status.md` — sibling status doc for the
much-more-complex waveform-mode codec.
- `docs/instantel_protocol_reference.md §7.6.2` — historical
protocol-reference entry. Structural framing matches what we
found; per-sample semantics were less documented than the `✅
CONFIRMED` badge suggested. This doc supersedes §7.6.2 where they
conflict on confidence level.