d506ebc103
the BE9558 / BE18003 extension-byte case The bytes at [7]/[11]/[15]/[19] are an annotation field (purpose still unclear — empirically non-zero on intervals with sub-Hz or unmeasurable freq), NOT the high byte of the peak count. The N844 fixture corpus the original RE was done against had zero values in those bytes for every block, so uint8 and uint16 LE were equivalent there — but on real BE9558 Tran-drift events and BE18003 Histogram+Continuous events the uint16 LE interpretation produced peaks up to 268 in/s and 35× inflated PVS sums. Cross-correlated against BW's per-interval ASCII export on: - K558LKZU/LL1P/LL3K → 100% T/V/L/M peak match (1435 blocks each) - T003LKZR/LL0O/LL1M → 100% T/V/L, 99.3% M (0.05 dB rounding only) - N599LKZS/LL0L → 100% all channels - N844 fixture corpus → 100% all channels (unchanged) Annotations preserved on every record for future RE; the defensive _MAX_PEAK_COUNT bound is no longer needed (uint8 maxes at 1.275 in/s, well below any physical limit). Synthetic regression test added using the verbatim K558LKZU.RE0H interval-12 block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
186 lines
7.9 KiB
Markdown
186 lines
7.9 KiB
Markdown
# Histogram body codec — FULLY DECODED (2026-05-20)
|
||
|
||
Clean working status doc for the MiniMate Plus histogram-mode event
|
||
body codec. Companion to `waveform_codec_re_status.md`. The deep
|
||
historical record (with retractions and dated analyses) lives in
|
||
`docs/instantel_protocol_reference.md §7.6.2`; the authoritative
|
||
implementation lives in `minimateplus/histogram_codec.py`.
|
||
|
||
## TL;DR
|
||
|
||
**The codec is fully decoded.** Every field of every block in the
|
||
in-repo histogram fixture corpus decodes byte-exact against BW's
|
||
ASCII export.
|
||
|
||
26 regression tests pass against ~3,500 blocks across 5 in-repo
|
||
fixtures, plus a synthetic regression block taken from a real
|
||
BE9558 prod event to lock in the uint8-peak interpretation.
|
||
|
||
**Important correction (2026-05-21):** the per-channel peak count
|
||
is `uint8` at byte[6]/[10]/[14]/[18], NOT `uint16 LE` at byte[6:8]
|
||
etc. The N844 fixture corpus the original RE was done against has
|
||
zero values in bytes [7]/[11]/[15]/[19] for every block, so the
|
||
two interpretations happened to be equivalent. Cross-correlating
|
||
non-N844 events (BE9558 Tran-drift, BE18003 Histogram+Continuous)
|
||
against BW's per-interval ASCII export — 4 channels × ~1400 blocks
|
||
per event × multiple events = 100% byte-exact only when the peak
|
||
is read as uint8. Reading as uint16 LE produced peaks up to 268
|
||
in/s per channel and 35× inflated PVS sums when first deployed to
|
||
prod (rolled back, root-caused, and fixed in commit 7183b95+1).
|
||
|
||
## Body format
|
||
|
||
```
|
||
body = [stream of 32-byte data blocks] + [small trailing remnant]
|
||
```
|
||
|
||
Each block represents one histogram interval. Block layout:
|
||
|
||
```
|
||
[0] 0x00 always-zero tag
|
||
[1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment
|
||
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, …)
|
||
[4:6] 0x000a (uint16 LE) constant marker (= 10)
|
||
[6] T_peak_count uint8 Tran peak (count × 0.005 → in/s at Normal,
|
||
max 1.275 in/s — fits in uint8)
|
||
[7] T_annotation uint8 empirically non-zero on intervals with sub-Hz
|
||
or unmeasurable freq; meaning not fully RE'd
|
||
[8:10] T_halfperiod uint16 LE Tran half-period in samples
|
||
(freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
|
||
[10] V_peak_count uint8 Vert peak
|
||
[11] V_annotation uint8
|
||
[12:14] V_halfperiod uint16 LE Vert freq half-period
|
||
[14] L_peak_count uint8 Long peak
|
||
[15] L_annotation uint8
|
||
[16:18] L_halfperiod uint16 LE Long freq half-period
|
||
[18] M_peak_count uint8 MicL peak count
|
||
(dB via waveform_codec.mic_count_to_db)
|
||
[19] M_annotation uint8
|
||
[20:22] M_halfperiod uint16 LE MicL freq half-period
|
||
[22:24] 0x00 0x00 constant
|
||
[24:28] 4-byte variable purpose unknown — possibly CRC,
|
||
timestamp delta, or psi(L) numeric;
|
||
not needed for waveform reconstruction
|
||
[28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature
|
||
```
|
||
|
||
Reliable block-identification anchor:
|
||
```python
|
||
block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"
|
||
```
|
||
(The `1e 0a 00 00` constant tail is the most distinctive signature.)
|
||
|
||
## Per-channel encoding
|
||
|
||
| Channel | Peak encoding | Frequency encoding |
|
||
|---|---|---|
|
||
| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` |
|
||
| Vert | same | same |
|
||
| Long | same | same |
|
||
| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same |
|
||
|
||
**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the
|
||
512/halfp formula), BW displays `>100 Hz`. Codec's `half_period_to_hz`
|
||
returns `None` in this range.
|
||
|
||
## Verified facts (cross-checked against fixture corpus)
|
||
|
||
Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:
|
||
|
||
```
|
||
binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
|
||
TXT row [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]
|
||
|
||
slot[0] = 10 marker
|
||
slot[1] = 6 × 0.005 = 0.030 in/s ✓ T_peak
|
||
slot[2] = 24 → 512/24 = 21.3 → 21 Hz ✓ T_freq
|
||
slot[3] = 4 × 0.005 = 0.020 in/s ✓ V_peak
|
||
slot[4] = 18 → 512/18 = 28.4 → 28 Hz ✓ V_freq
|
||
slot[5] = 5 × 0.005 = 0.025 in/s ✓ L_peak
|
||
slot[6] = 21 → 512/21 = 24.4 → 24 Hz ✓ L_freq
|
||
slot[7] = 5 → 81.94 + 20·log10(5) = 95.92 dB ✓ M_peak
|
||
slot[8] = 9 → 512/9 = 56.9 → 57 Hz ✓ M_freq
|
||
```
|
||
|
||
## Verified test coverage
|
||
|
||
`tests/test_histogram_codec.py` (24 tests):
|
||
|
||
- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one
|
||
at the tail when recording was stopped mid-write). Segment-ID
|
||
groups of 256 blocks confirmed.
|
||
- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B
|
||
matches `.TXT` within the 0.0005 in/s quantization step.
|
||
- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT`
|
||
within 1 Hz (BW display rounds). `>100 Hz` sentinel handled correctly.
|
||
- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT`
|
||
within 0.1 dB (BW display precision).
|
||
- Mic freq: matches `.TXT` within 1 Hz across active blocks.
|
||
|
||
## What's NOT yet decoded
|
||
|
||
- **Annotation bytes (`block[7]/[11]/[15]/[19]`)**. Empirically
|
||
non-zero on intervals where the per-channel ZC frequency comes
|
||
out as `N/A` or sub-Hz (`<1.0`, `1.X`). Hypothesis tested in the
|
||
RE session: byte != 0 ↔ sub-Hz freq. Only ~50% correlation
|
||
across the K558 corpus, so the relationship is more complex.
|
||
Possibilities: time-of-peak-within-interval, halfp extension for
|
||
very-long-period signals, or a debug/diagnostic field the firmware
|
||
writes opportunistically. Doesn't affect peak amplitudes or
|
||
waveform reconstruction. Captured as `record["annotations"]` for
|
||
future RE.
|
||
- **4-byte variable metadata field (bytes 24:28)**. Not needed for
|
||
waveform reconstruction. Speculation: per-block CRC, sub-second
|
||
timestamp offset, or a Mic psi(L) count not in the 9 samples.
|
||
Punt until something needs it.
|
||
- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**. Not stored in the
|
||
block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)`
|
||
but BW's value sometimes differs slightly (probably computed from
|
||
waveform-instant samples, not from per-channel peaks). Punt — the
|
||
`.h5` consumers don't need PVS as a sample channel.
|
||
- **Mic psi(L) value (TXT col 8)**. TXT shows it as a small psi value
|
||
derived from the dB measurement. Not in the 9 samples. Could be
|
||
derived from `M_peak_count` via the inverse of the dB formula plus
|
||
a psi calibration constant. Defer.
|
||
|
||
## Output shape
|
||
|
||
`decode_histogram_body` returns the standard 4-channel dict that
|
||
mirrors `waveform_codec.decode_waveform_v2`'s output:
|
||
|
||
```python
|
||
{
|
||
"Tran": [peak_count_per_interval, ...], # 16-count units (LSB = 0.005 in/s)
|
||
"Vert": [..., ...],
|
||
"Long": [..., ...],
|
||
"MicL": [..., ...], # raw ADC counts
|
||
}
|
||
```
|
||
|
||
Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC
|
||
units (geo ×16, mic passthrough) for the standard `.h5` writer.
|
||
|
||
For the full per-interval record with frequencies + metadata, use
|
||
`decode_histogram_body_full()`.
|
||
|
||
## Where it's wired
|
||
|
||
- `minimateplus/event_file_io.py:read_blastware_file()` — first tries
|
||
the waveform codec, falls back to the histogram codec when the
|
||
waveform preamble isn't present. Same output shape, same
|
||
downstream pipeline.
|
||
- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit
|
||
added during the histogram-codec-pending era still serves as a
|
||
defensive guard against truly undecodable files, but no longer
|
||
fires for valid histograms.
|
||
|
||
## Companion reference
|
||
|
||
- `docs/waveform_codec_re_status.md` — sibling status doc for the
|
||
much-more-complex waveform-mode codec.
|
||
- `docs/instantel_protocol_reference.md §7.6.2` — historical
|
||
protocol-reference entry. Structural framing matches what we
|
||
found; per-sample semantics were less documented than the `✅
|
||
CONFIRMED` badge suggested. This doc supersedes §7.6.2 where they
|
||
conflict on confidence level.
|