seismo-relay/docs/histogram_codec_re_status.md

# Histogram body codec — FULLY DECODED (2026-05-20)

Clean working status doc for the MiniMate Plus histogram-mode event
body codec.  Companion to `waveform_codec_re_status.md`.  The deep
historical record (with retractions and dated analyses) lives in
`docs/instantel_protocol_reference.md §7.6.2`; the authoritative
implementation lives in `minimateplus/histogram_codec.py`.

## TL;DR

**The codec is fully decoded.**  Every field of every block in the
in-repo histogram fixture corpus decodes byte-exact against BW's
ASCII export.

26 regression tests pass against ~3,500 blocks across 5 in-repo
fixtures, plus a synthetic regression block taken from a real
BE9558 prod event to lock in the uint8-peak interpretation.

**Important correction (2026-05-21):** the per-channel peak count
is `uint8` at byte[6]/[10]/[14]/[18], NOT `uint16 LE` at byte[6:8]
etc.  The N844 fixture corpus the original RE was done against has
zero values in bytes [7]/[11]/[15]/[19] for every block, so the
two interpretations happened to be equivalent.  Cross-correlating
non-N844 events (BE9558 Tran-drift, BE18003 Histogram+Continuous)
against BW's per-interval ASCII export — 4 channels × ~1400 blocks
per event × multiple events = 100% byte-exact only when the peak
is read as uint8.  Reading as uint16 LE produced peaks up to 268
in/s per channel and 35× inflated PVS sums when first deployed to
prod (rolled back, root-caused, and fixed in commit 7183b95+1).

## Body format

```
body = [stream of 32-byte data blocks] + [small trailing remnant]
```

Each block represents one histogram interval.  Block layout:

```
[0]    0x00                      always-zero tag
[1]    segment_id (uint8)        0x00..0x03 — 256 blocks per segment
[2:4]  block_ctr (uint16 LE)     resets each segment (0x0100, 0x0101, …)
[4:6]  0x000a (uint16 LE)        constant marker (= 10)
[6]    T_peak_count   uint8      Tran peak (count × 0.005 → in/s at Normal,
                                  max 1.275 in/s — fits in uint8)
[7]    T_annotation   uint8      empirically non-zero on intervals with sub-Hz
                                  or unmeasurable freq; meaning not fully RE'd
[8:10] T_halfperiod   uint16 LE  Tran half-period in samples
                                  (freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
[10]   V_peak_count   uint8      Vert peak
[11]   V_annotation   uint8
[12:14] V_halfperiod  uint16 LE  Vert freq half-period
[14]   L_peak_count   uint8      Long peak
[15]   L_annotation   uint8
[16:18] L_halfperiod  uint16 LE  Long freq half-period
[18]   M_peak_count   uint8      MicL peak count
                                  (dB via waveform_codec.mic_count_to_db)
[19]   M_annotation   uint8
[20:22] M_halfperiod  uint16 LE  MicL freq half-period
[22:24] 0x00 0x00                constant
[24:28] 4-byte variable          purpose unknown — possibly CRC,
                                  timestamp delta, or psi(L) numeric;
                                  not needed for waveform reconstruction
[28:32] 0x1e 0x0a 0x00 0x00      constant block-end signature
```

Reliable block-identification anchor:
```python
block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"
```
(The `1e 0a 00 00` constant tail is the most distinctive signature.)

## Per-channel encoding

| Channel | Peak encoding | Frequency encoding |
|---|---|---|
| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` |
| Vert | same | same |
| Long | same | same |
| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same |

**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the
512/halfp formula), BW displays `>100 Hz`.  Codec's `half_period_to_hz`
returns `None` in this range.

## Verified facts (cross-checked against fixture corpus)

Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:

```
binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
TXT row        [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]

slot[0] = 10                                  marker
slot[1] = 6  × 0.005 = 0.030 in/s         ✓ T_peak
slot[2] = 24 → 512/24 = 21.3 → 21 Hz      ✓ T_freq
slot[3] = 4  × 0.005 = 0.020 in/s         ✓ V_peak
slot[4] = 18 → 512/18 = 28.4 → 28 Hz      ✓ V_freq
slot[5] = 5  × 0.005 = 0.025 in/s         ✓ L_peak
slot[6] = 21 → 512/21 = 24.4 → 24 Hz      ✓ L_freq
slot[7] = 5  → 81.94 + 20·log10(5) = 95.92 dB  ✓ M_peak
slot[8] = 9  → 512/9 = 56.9 → 57 Hz       ✓ M_freq
```

## Verified test coverage

`tests/test_histogram_codec.py` (24 tests):

- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one
  at the tail when recording was stopped mid-write).  Segment-ID
  groups of 256 blocks confirmed.
- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B
  matches `.TXT` within the 0.0005 in/s quantization step.
- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT`
  within 1 Hz (BW display rounds).  `>100 Hz` sentinel handled correctly.
- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT`
  within 0.1 dB (BW display precision).
- Mic freq: matches `.TXT` within 1 Hz across active blocks.

## What's NOT yet decoded

- **Annotation bytes (`block[7]/[11]/[15]/[19]`)**.  Empirically
  non-zero on intervals where the per-channel ZC frequency comes
  out as `N/A` or sub-Hz (`<1.0`, `1.X`).  Hypothesis tested in the
  RE session: byte != 0 ↔ sub-Hz freq.  Only ~50% correlation
  across the K558 corpus, so the relationship is more complex.
  Possibilities: time-of-peak-within-interval, halfp extension for
  very-long-period signals, or a debug/diagnostic field the firmware
  writes opportunistically.  Doesn't affect peak amplitudes or
  waveform reconstruction.  Captured as `record["annotations"]` for
  future RE.
- **4-byte variable metadata field (bytes 24:28)**.  Not needed for
  waveform reconstruction.  Speculation: per-block CRC, sub-second
  timestamp offset, or a Mic psi(L) count not in the 9 samples.
  Punt until something needs it.
- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**.  Not stored in the
  block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)`
  but BW's value sometimes differs slightly (probably computed from
  waveform-instant samples, not from per-channel peaks).  Punt — the
  `.h5` consumers don't need PVS as a sample channel.
- **Mic psi(L) value (TXT col 8)**.  TXT shows it as a small psi value
  derived from the dB measurement.  Not in the 9 samples.  Could be
  derived from `M_peak_count` via the inverse of the dB formula plus
  a psi calibration constant.  Defer.

## Output shape

`decode_histogram_body` returns the standard 4-channel dict that
mirrors `waveform_codec.decode_waveform_v2`'s output:

```python
{
    "Tran": [peak_count_per_interval, ...],   # 16-count units (LSB = 0.005 in/s)
    "Vert": [..., ...],
    "Long": [..., ...],
    "MicL": [..., ...],                       # raw ADC counts
}
```

Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC
units (geo ×16, mic passthrough) for the standard `.h5` writer.

For the full per-interval record with frequencies + metadata, use
`decode_histogram_body_full()`.

## Where it's wired

- `minimateplus/event_file_io.py:read_blastware_file()` — first tries
  the waveform codec, falls back to the histogram codec when the
  waveform preamble isn't present.  Same output shape, same
  downstream pipeline.
- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit
  added during the histogram-codec-pending era still serves as a
  defensive guard against truly undecodable files, but no longer
  fires for valid histograms.

## Companion reference

- `docs/waveform_codec_re_status.md` — sibling status doc for the
  much-more-complex waveform-mode codec.
- `docs/instantel_protocol_reference.md §7.6.2` — historical
  protocol-reference entry.  Structural framing matches what we
  found; per-sample semantics were less documented than the `✅
  CONFIRMED` badge suggested.  This doc supersedes §7.6.2 where they
  conflict on confidence level.