7183b953e4
The histogram-mode event body is now byte-exact decodable.
Companion to the waveform body codec — together they cover every
event file the watcher forwards. Cracked in one session via
cross-event correlation against BW's ASCII export.
The §7.6.2 spec in instantel_protocol_reference.md was structurally
correct (32-byte blocks) but the per-sample semantics were
under-documented. Cross-checking block 130 of N844L6Z8.ZR0H
against its TXT row revealed the layout perfectly:
slot[0] = 10 (constant marker)
slot[1] = T_peak_count (× 0.005 → in/s at Normal range)
slot[2] = T_halfperiod (freq_Hz = 512 / halfp)
slot[3] = V_peak_count
slot[4] = V_halfperiod
slot[5] = L_peak_count
slot[6] = L_halfperiod
slot[7] = MicL_peak_count (dB via waveform_codec.mic_count_to_db)
slot[8] = MicL_halfperiod
The `>100 Hz` sentinel is halfperiod ≤ 5 (since 512/5 = 100 Hz).
Mic dB uses the SAME formula as the waveform codec (sign × (81.94
+ 20·log10(|count|))) — they share the mic ADC calibration constant.
Block identification anchor: bytes [22:24] == 0x0000 AND
bytes [28:32] == 1e 0a 00 00. The tail signature is the most
reliable distinguisher from non-block content in the file.
Files:
minimateplus/histogram_codec.py (new) — decoder + public API
matching the waveform codec's shape:
walk_body(body) -> records
decode_histogram_body(body) -> {Tran, Vert, Long, MicL}
decode_histogram_body_full(body) -> [per-interval dicts]
half_period_to_hz, geo_count_to_ins helpers
minimateplus/event_file_io.py (modified) — read_blastware_file
now tries the waveform codec first, falls back to the histogram
codec on failure. Same output shape, same downstream pipeline.
tests/test_histogram_codec.py (new) — 24 regression locks against
the in-repo fixture corpus, byte-exact against BW ASCII export
for peaks (all 4 channels), frequencies (all 4 channels,
including >100 Hz sentinel handling), block framing, and
segment-ID accounting.
scripts/backfill_sidecars.py (modified) — the has_samples
short-circuit added in the histogram-pending era is now a
pure defensive guard. Histograms in prod will regen .h5 files
correctly on the next backfill run.
docs/histogram_codec_re_status.md (updated) — supersedes the
earlier "in progress" version with the verified format and
test-coverage summary. Notes a few non-essential fields still
open (4-byte block metadata, Geo PVS, Mic psi(L) — none of
which are needed for waveform reconstruction).
Total verified coverage: ~3,500 blocks across 5 fixtures, every
field of every block byte-exact against BW.
The watcher-forwarded histogram event corpus on prod (~10,000
events) will now produce correct .h5 sidecars on the next backfill
run. No additional changes needed to the backfill flow — the
existing tool_version-bump cascade picks them up automatically.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
156 lines
6.2 KiB
Markdown
156 lines
6.2 KiB
Markdown
# Histogram body codec — FULLY DECODED (2026-05-20)
|
||
|
||
Clean working status doc for the MiniMate Plus histogram-mode event
|
||
body codec. Companion to `waveform_codec_re_status.md`. The deep
|
||
historical record (with retractions and dated analyses) lives in
|
||
`docs/instantel_protocol_reference.md §7.6.2`; the authoritative
|
||
implementation lives in `minimateplus/histogram_codec.py`.
|
||
|
||
## TL;DR
|
||
|
||
**The codec is fully decoded.** Every field of every block in the
|
||
in-repo histogram fixture corpus decodes byte-exact against BW's
|
||
ASCII export.
|
||
|
||
24 regression tests pass against ~3,500 blocks across 5 fixtures.
|
||
|
||
## Body format
|
||
|
||
```
|
||
body = [stream of 32-byte data blocks] + [small trailing remnant]
|
||
```
|
||
|
||
Each block represents one histogram interval. Block layout:
|
||
|
||
```
|
||
[0] 0x00 always-zero tag
|
||
[1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment
|
||
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, …)
|
||
[4:6] 0x000a (uint16 LE) constant marker (= 10)
|
||
[6:8] T_peak_count uint16 LE Tran peak (count × 0.005 → in/s at Normal)
|
||
[8:10] T_halfperiod uint16 LE Tran half-period in samples
|
||
(freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
|
||
[10:12] V_peak_count uint16 LE Vert peak
|
||
[12:14] V_halfperiod uint16 LE Vert freq half-period
|
||
[14:16] L_peak_count uint16 LE Long peak
|
||
[16:18] L_halfperiod uint16 LE Long freq half-period
|
||
[18:20] M_peak_count uint16 LE MicL peak count
|
||
(dB via waveform_codec.mic_count_to_db)
|
||
[20:22] M_halfperiod uint16 LE MicL freq half-period
|
||
[22:24] 0x00 0x00 constant
|
||
[24:28] 4-byte variable purpose unknown — possibly CRC,
|
||
timestamp delta, or psi(L) numeric;
|
||
not needed for waveform reconstruction
|
||
[28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature
|
||
```
|
||
|
||
Reliable block-identification anchor:
|
||
```python
|
||
block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"
|
||
```
|
||
(The `1e 0a 00 00` constant tail is the most distinctive signature.)
|
||
|
||
## Per-channel encoding
|
||
|
||
| Channel | Peak encoding | Frequency encoding |
|
||
|---|---|---|
|
||
| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` |
|
||
| Vert | same | same |
|
||
| Long | same | same |
|
||
| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same |
|
||
|
||
**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the
|
||
512/halfp formula), BW displays `>100 Hz`. Codec's `half_period_to_hz`
|
||
returns `None` in this range.
|
||
|
||
## Verified facts (cross-checked against fixture corpus)
|
||
|
||
Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:
|
||
|
||
```
|
||
binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
|
||
TXT row [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]
|
||
|
||
slot[0] = 10 marker
|
||
slot[1] = 6 × 0.005 = 0.030 in/s ✓ T_peak
|
||
slot[2] = 24 → 512/24 = 21.3 → 21 Hz ✓ T_freq
|
||
slot[3] = 4 × 0.005 = 0.020 in/s ✓ V_peak
|
||
slot[4] = 18 → 512/18 = 28.4 → 28 Hz ✓ V_freq
|
||
slot[5] = 5 × 0.005 = 0.025 in/s ✓ L_peak
|
||
slot[6] = 21 → 512/21 = 24.4 → 24 Hz ✓ L_freq
|
||
slot[7] = 5 → 81.94 + 20·log10(5) = 95.92 dB ✓ M_peak
|
||
slot[8] = 9 → 512/9 = 56.9 → 57 Hz ✓ M_freq
|
||
```
|
||
|
||
## Verified test coverage
|
||
|
||
`tests/test_histogram_codec.py` (24 tests):
|
||
|
||
- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one
|
||
at the tail when recording was stopped mid-write). Segment-ID
|
||
groups of 256 blocks confirmed.
|
||
- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B
|
||
matches `.TXT` within the 0.0005 in/s quantization step.
|
||
- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT`
|
||
within 1 Hz (BW display rounds). `>100 Hz` sentinel handled correctly.
|
||
- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT`
|
||
within 0.1 dB (BW display precision).
|
||
- Mic freq: matches `.TXT` within 1 Hz across active blocks.
|
||
|
||
## What's NOT yet decoded
|
||
|
||
- **4-byte variable metadata field (bytes 24:28)**. Not needed for
|
||
waveform reconstruction. Speculation: per-block CRC, sub-second
|
||
timestamp offset, or a Mic psi(L) count not in the 9 samples.
|
||
Punt until something needs it.
|
||
- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**. Not stored in the
|
||
block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)`
|
||
but BW's value sometimes differs slightly (probably computed from
|
||
waveform-instant samples, not from per-channel peaks). Punt — the
|
||
`.h5` consumers don't need PVS as a sample channel.
|
||
- **Mic psi(L) value (TXT col 8)**. TXT shows it as a small psi value
|
||
derived from the dB measurement. Not in the 9 samples. Could be
|
||
derived from `M_peak_count` via the inverse of the dB formula plus
|
||
a psi calibration constant. Defer.
|
||
|
||
## Output shape
|
||
|
||
`decode_histogram_body` returns the standard 4-channel dict that
|
||
mirrors `waveform_codec.decode_waveform_v2`'s output:
|
||
|
||
```python
|
||
{
|
||
"Tran": [peak_count_per_interval, ...], # 16-count units (LSB = 0.005 in/s)
|
||
"Vert": [..., ...],
|
||
"Long": [..., ...],
|
||
"MicL": [..., ...], # raw ADC counts
|
||
}
|
||
```
|
||
|
||
Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC
|
||
units (geo ×16, mic passthrough) for the standard `.h5` writer.
|
||
|
||
For the full per-interval record with frequencies + metadata, use
|
||
`decode_histogram_body_full()`.
|
||
|
||
## Where it's wired
|
||
|
||
- `minimateplus/event_file_io.py:read_blastware_file()` — first tries
|
||
the waveform codec, falls back to the histogram codec when the
|
||
waveform preamble isn't present. Same output shape, same
|
||
downstream pipeline.
|
||
- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit
|
||
added during the histogram-codec-pending era still serves as a
|
||
defensive guard against truly undecodable files, but no longer
|
||
fires for valid histograms.
|
||
|
||
## Companion reference
|
||
|
||
- `docs/waveform_codec_re_status.md` — sibling status doc for the
|
||
much-more-complex waveform-mode codec.
|
||
- `docs/instantel_protocol_reference.md §7.6.2` — historical
|
||
protocol-reference entry. Structural framing matches what we
|
||
found; per-sample semantics were less documented than the `✅
|
||
CONFIRMED` badge suggested. This doc supersedes §7.6.2 where they
|
||
conflict on confidence level.
|