Files
seismo-relay/docs/histogram_codec_re_status.md
T
serversdown 7183b953e4 minimateplus: histogram body codec — FULLY DECODED
The histogram-mode event body is now byte-exact decodable.
Companion to the waveform body codec — together they cover every
event file the watcher forwards.  Cracked in one session via
cross-event correlation against BW's ASCII export.

The §7.6.2 spec in instantel_protocol_reference.md was structurally
correct (32-byte blocks) but the per-sample semantics were
under-documented.  Cross-checking block 130 of N844L6Z8.ZR0H
against its TXT row revealed the layout perfectly:

  slot[0] = 10 (constant marker)
  slot[1] = T_peak_count    (× 0.005 → in/s at Normal range)
  slot[2] = T_halfperiod    (freq_Hz = 512 / halfp)
  slot[3] = V_peak_count
  slot[4] = V_halfperiod
  slot[5] = L_peak_count
  slot[6] = L_halfperiod
  slot[7] = MicL_peak_count (dB via waveform_codec.mic_count_to_db)
  slot[8] = MicL_halfperiod

The `>100 Hz` sentinel is halfperiod ≤ 5 (since 512/5 = 100 Hz).
Mic dB uses the SAME formula as the waveform codec (sign × (81.94
+ 20·log10(|count|))) — they share the mic ADC calibration constant.

Block identification anchor: bytes [22:24] == 0x0000 AND
bytes [28:32] == 1e 0a 00 00.  The tail signature is the most
reliable distinguisher from non-block content in the file.

Files:

  minimateplus/histogram_codec.py (new) — decoder + public API
    matching the waveform codec's shape:
      walk_body(body) -> records
      decode_histogram_body(body) -> {Tran, Vert, Long, MicL}
      decode_histogram_body_full(body) -> [per-interval dicts]
      half_period_to_hz, geo_count_to_ins helpers

  minimateplus/event_file_io.py (modified) — read_blastware_file
    now tries the waveform codec first, falls back to the histogram
    codec on failure.  Same output shape, same downstream pipeline.

  tests/test_histogram_codec.py (new) — 24 regression locks against
    the in-repo fixture corpus, byte-exact against BW ASCII export
    for peaks (all 4 channels), frequencies (all 4 channels,
    including >100 Hz sentinel handling), block framing, and
    segment-ID accounting.

  scripts/backfill_sidecars.py (modified) — the has_samples
    short-circuit added in the histogram-pending era is now a
    pure defensive guard.  Histograms in prod will regen .h5 files
    correctly on the next backfill run.

  docs/histogram_codec_re_status.md (updated) — supersedes the
    earlier "in progress" version with the verified format and
    test-coverage summary.  Notes a few non-essential fields still
    open (4-byte block metadata, Geo PVS, Mic psi(L) — none of
    which are needed for waveform reconstruction).

Total verified coverage: ~3,500 blocks across 5 fixtures, every
field of every block byte-exact against BW.

The watcher-forwarded histogram event corpus on prod (~10,000
events) will now produce correct .h5 sidecars on the next backfill
run.  No additional changes needed to the backfill flow — the
existing tool_version-bump cascade picks them up automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:05:13 +00:00

156 lines
6.2 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Histogram body codec — FULLY DECODED (2026-05-20)
Clean working status doc for the MiniMate Plus histogram-mode event
body codec. Companion to `waveform_codec_re_status.md`. The deep
historical record (with retractions and dated analyses) lives in
`docs/instantel_protocol_reference.md §7.6.2`; the authoritative
implementation lives in `minimateplus/histogram_codec.py`.
## TL;DR
**The codec is fully decoded.** Every field of every block in the
in-repo histogram fixture corpus decodes byte-exact against BW's
ASCII export.
24 regression tests pass against ~3,500 blocks across 5 fixtures.
## Body format
```
body = [stream of 32-byte data blocks] + [small trailing remnant]
```
Each block represents one histogram interval. Block layout:
```
[0] 0x00 always-zero tag
[1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, …)
[4:6] 0x000a (uint16 LE) constant marker (= 10)
[6:8] T_peak_count uint16 LE Tran peak (count × 0.005 → in/s at Normal)
[8:10] T_halfperiod uint16 LE Tran half-period in samples
(freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
[10:12] V_peak_count uint16 LE Vert peak
[12:14] V_halfperiod uint16 LE Vert freq half-period
[14:16] L_peak_count uint16 LE Long peak
[16:18] L_halfperiod uint16 LE Long freq half-period
[18:20] M_peak_count uint16 LE MicL peak count
(dB via waveform_codec.mic_count_to_db)
[20:22] M_halfperiod uint16 LE MicL freq half-period
[22:24] 0x00 0x00 constant
[24:28] 4-byte variable purpose unknown — possibly CRC,
timestamp delta, or psi(L) numeric;
not needed for waveform reconstruction
[28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature
```
Reliable block-identification anchor:
```python
block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"
```
(The `1e 0a 00 00` constant tail is the most distinctive signature.)
## Per-channel encoding
| Channel | Peak encoding | Frequency encoding |
|---|---|---|
| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` |
| Vert | same | same |
| Long | same | same |
| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same |
**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the
512/halfp formula), BW displays `>100 Hz`. Codec's `half_period_to_hz`
returns `None` in this range.
## Verified facts (cross-checked against fixture corpus)
Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:
```
binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
TXT row [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]
slot[0] = 10 marker
slot[1] = 6 × 0.005 = 0.030 in/s ✓ T_peak
slot[2] = 24 → 512/24 = 21.3 → 21 Hz ✓ T_freq
slot[3] = 4 × 0.005 = 0.020 in/s ✓ V_peak
slot[4] = 18 → 512/18 = 28.4 → 28 Hz ✓ V_freq
slot[5] = 5 × 0.005 = 0.025 in/s ✓ L_peak
slot[6] = 21 → 512/21 = 24.4 → 24 Hz ✓ L_freq
slot[7] = 5 → 81.94 + 20·log10(5) = 95.92 dB ✓ M_peak
slot[8] = 9 → 512/9 = 56.9 → 57 Hz ✓ M_freq
```
## Verified test coverage
`tests/test_histogram_codec.py` (24 tests):
- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one
at the tail when recording was stopped mid-write). Segment-ID
groups of 256 blocks confirmed.
- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B
matches `.TXT` within the 0.0005 in/s quantization step.
- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT`
within 1 Hz (BW display rounds). `>100 Hz` sentinel handled correctly.
- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT`
within 0.1 dB (BW display precision).
- Mic freq: matches `.TXT` within 1 Hz across active blocks.
## What's NOT yet decoded
- **4-byte variable metadata field (bytes 24:28)**. Not needed for
waveform reconstruction. Speculation: per-block CRC, sub-second
timestamp offset, or a Mic psi(L) count not in the 9 samples.
Punt until something needs it.
- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**. Not stored in the
block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)`
but BW's value sometimes differs slightly (probably computed from
waveform-instant samples, not from per-channel peaks). Punt — the
`.h5` consumers don't need PVS as a sample channel.
- **Mic psi(L) value (TXT col 8)**. TXT shows it as a small psi value
derived from the dB measurement. Not in the 9 samples. Could be
derived from `M_peak_count` via the inverse of the dB formula plus
a psi calibration constant. Defer.
## Output shape
`decode_histogram_body` returns the standard 4-channel dict that
mirrors `waveform_codec.decode_waveform_v2`'s output:
```python
{
"Tran": [peak_count_per_interval, ...], # 16-count units (LSB = 0.005 in/s)
"Vert": [..., ...],
"Long": [..., ...],
"MicL": [..., ...], # raw ADC counts
}
```
Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC
units (geo ×16, mic passthrough) for the standard `.h5` writer.
For the full per-interval record with frequencies + metadata, use
`decode_histogram_body_full()`.
## Where it's wired
- `minimateplus/event_file_io.py:read_blastware_file()` — first tries
the waveform codec, falls back to the histogram codec when the
waveform preamble isn't present. Same output shape, same
downstream pipeline.
- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit
added during the histogram-codec-pending era still serves as a
defensive guard against truly undecodable files, but no longer
fires for valid histograms.
## Companion reference
- `docs/waveform_codec_re_status.md` — sibling status doc for the
much-more-complex waveform-mode codec.
- `docs/instantel_protocol_reference.md §7.6.2` — historical
protocol-reference entry. Structural framing matches what we
found; per-sample semantics were less documented than the `✅
CONFIRMED` badge suggested. This doc supersedes §7.6.2 where they
conflict on confidence level.