Files
seismo-relay/docs/histogram_codec_re_status.md
T
serversdown 7183b953e4 minimateplus: histogram body codec — FULLY DECODED
The histogram-mode event body is now byte-exact decodable.
Companion to the waveform body codec — together they cover every
event file the watcher forwards.  Cracked in one session via
cross-event correlation against BW's ASCII export.

The §7.6.2 spec in instantel_protocol_reference.md was structurally
correct (32-byte blocks) but the per-sample semantics were
under-documented.  Cross-checking block 130 of N844L6Z8.ZR0H
against its TXT row revealed the layout perfectly:

  slot[0] = 10 (constant marker)
  slot[1] = T_peak_count    (× 0.005 → in/s at Normal range)
  slot[2] = T_halfperiod    (freq_Hz = 512 / halfp)
  slot[3] = V_peak_count
  slot[4] = V_halfperiod
  slot[5] = L_peak_count
  slot[6] = L_halfperiod
  slot[7] = MicL_peak_count (dB via waveform_codec.mic_count_to_db)
  slot[8] = MicL_halfperiod

The `>100 Hz` sentinel is halfperiod ≤ 5 (since 512/5 = 100 Hz).
Mic dB uses the SAME formula as the waveform codec (sign × (81.94
+ 20·log10(|count|))) — they share the mic ADC calibration constant.

Block identification anchor: bytes [22:24] == 0x0000 AND
bytes [28:32] == 1e 0a 00 00.  The tail signature is the most
reliable distinguisher from non-block content in the file.

Files:

  minimateplus/histogram_codec.py (new) — decoder + public API
    matching the waveform codec's shape:
      walk_body(body) -> records
      decode_histogram_body(body) -> {Tran, Vert, Long, MicL}
      decode_histogram_body_full(body) -> [per-interval dicts]
      half_period_to_hz, geo_count_to_ins helpers

  minimateplus/event_file_io.py (modified) — read_blastware_file
    now tries the waveform codec first, falls back to the histogram
    codec on failure.  Same output shape, same downstream pipeline.

  tests/test_histogram_codec.py (new) — 24 regression locks against
    the in-repo fixture corpus, byte-exact against BW ASCII export
    for peaks (all 4 channels), frequencies (all 4 channels,
    including >100 Hz sentinel handling), block framing, and
    segment-ID accounting.

  scripts/backfill_sidecars.py (modified) — the has_samples
    short-circuit added in the histogram-pending era is now a
    pure defensive guard.  Histograms in prod will regen .h5 files
    correctly on the next backfill run.

  docs/histogram_codec_re_status.md (updated) — supersedes the
    earlier "in progress" version with the verified format and
    test-coverage summary.  Notes a few non-essential fields still
    open (4-byte block metadata, Geo PVS, Mic psi(L) — none of
    which are needed for waveform reconstruction).

Total verified coverage: ~3,500 blocks across 5 fixtures, every
field of every block byte-exact against BW.

The watcher-forwarded histogram event corpus on prod (~10,000
events) will now produce correct .h5 sidecars on the next backfill
run.  No additional changes needed to the backfill flow — the
existing tool_version-bump cascade picks them up automatically.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 23:05:13 +00:00

6.2 KiB
Raw Blame History

Histogram body codec — FULLY DECODED (2026-05-20)

Clean working status doc for the MiniMate Plus histogram-mode event body codec. Companion to waveform_codec_re_status.md. The deep historical record (with retractions and dated analyses) lives in docs/instantel_protocol_reference.md §7.6.2; the authoritative implementation lives in minimateplus/histogram_codec.py.

TL;DR

The codec is fully decoded. Every field of every block in the in-repo histogram fixture corpus decodes byte-exact against BW's ASCII export.

24 regression tests pass against ~3,500 blocks across 5 fixtures.

Body format

body = [stream of 32-byte data blocks] + [small trailing remnant]

Each block represents one histogram interval. Block layout:

[0]    0x00                      always-zero tag
[1]    segment_id (uint8)        0x00..0x03 — 256 blocks per segment
[2:4]  block_ctr (uint16 LE)     resets each segment (0x0100, 0x0101, …)
[4:6]  0x000a (uint16 LE)        constant marker (= 10)
[6:8]  T_peak_count   uint16 LE  Tran peak (count × 0.005 → in/s at Normal)
[8:10] T_halfperiod   uint16 LE  Tran half-period in samples
                                  (freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz")
[10:12] V_peak_count  uint16 LE  Vert peak
[12:14] V_halfperiod  uint16 LE  Vert freq half-period
[14:16] L_peak_count  uint16 LE  Long peak
[16:18] L_halfperiod  uint16 LE  Long freq half-period
[18:20] M_peak_count  uint16 LE  MicL peak count
                                  (dB via waveform_codec.mic_count_to_db)
[20:22] M_halfperiod  uint16 LE  MicL freq half-period
[22:24] 0x00 0x00                constant
[24:28] 4-byte variable          purpose unknown — possibly CRC,
                                  timestamp delta, or psi(L) numeric;
                                  not needed for waveform reconstruction
[28:32] 0x1e 0x0a 0x00 0x00      constant block-end signature

Reliable block-identification anchor:

block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00"

(The 1e 0a 00 00 constant tail is the most distinctive signature.)

Per-channel encoding

Channel Peak encoding Frequency encoding
Tran count × 0.005 = in/s at Normal range freq_Hz = 512 / halfperiod
Vert same same
Long same same
MicL count → dB via mic_count_to_db(count) (same formula as waveform codec) same

>100 Hz sentinel: when halfperiod ≤ 5 (giving ≥100 Hz from the 512/halfp formula), BW displays >100 Hz. Codec's half_period_to_hz returns None in this range.

Verified facts (cross-checked against fixture corpus)

Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact:

binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9]
TXT row        [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57]

slot[0] = 10                                  marker
slot[1] = 6  × 0.005 = 0.030 in/s         ✓ T_peak
slot[2] = 24 → 512/24 = 21.3 → 21 Hz      ✓ T_freq
slot[3] = 4  × 0.005 = 0.020 in/s         ✓ V_peak
slot[4] = 18 → 512/18 = 28.4 → 28 Hz      ✓ V_freq
slot[5] = 5  × 0.005 = 0.025 in/s         ✓ L_peak
slot[6] = 21 → 512/21 = 24.4 → 24 Hz      ✓ L_freq
slot[7] = 5  → 81.94 + 20·log10(5) = 95.92 dB  ✓ M_peak
slot[8] = 9  → 512/9 = 56.9 → 57 Hz       ✓ M_freq

Verified test coverage

tests/test_histogram_codec.py (24 tests):

  • Block walking: yields one record per .TXT interval ± 1 (off-by-one at the tail when recording was stopped mid-write). Segment-ID groups of 256 blocks confirmed.
  • Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B matches .TXT within the 0.0005 in/s quantization step.
  • Geo freqs: every block of N844L6Z8 and N844L6XE matches .TXT within 1 Hz (BW display rounds). >100 Hz sentinel handled correctly.
  • Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches .TXT within 0.1 dB (BW display precision).
  • Mic freq: matches .TXT within 1 Hz across active blocks.

What's NOT yet decoded

  • 4-byte variable metadata field (bytes 24:28). Not needed for waveform reconstruction. Speculation: per-block CRC, sub-second timestamp offset, or a Mic psi(L) count not in the 9 samples. Punt until something needs it.
  • Geo PVS (TXT col 7, e.g. "0.040 in/s"). Not stored in the block; can be approximated as sqrt(T_peak² + V_peak² + L_peak²) but BW's value sometimes differs slightly (probably computed from waveform-instant samples, not from per-channel peaks). Punt — the .h5 consumers don't need PVS as a sample channel.
  • Mic psi(L) value (TXT col 8). TXT shows it as a small psi value derived from the dB measurement. Not in the 9 samples. Could be derived from M_peak_count via the inverse of the dB formula plus a psi calibration constant. Defer.

Output shape

decode_histogram_body returns the standard 4-channel dict that mirrors waveform_codec.decode_waveform_v2's output:

{
    "Tran": [peak_count_per_interval, ...],   # 16-count units (LSB = 0.005 in/s)
    "Vert": [..., ...],
    "Long": [..., ...],
    "MicL": [..., ...],                       # raw ADC counts
}

Run through waveform_codec.decoded_to_adc_counts to get 1-count ADC units (geo ×16, mic passthrough) for the standard .h5 writer.

For the full per-interval record with frequencies + metadata, use decode_histogram_body_full().

Where it's wired

  • minimateplus/event_file_io.py:read_blastware_file() — first tries the waveform codec, falls back to the histogram codec when the waveform preamble isn't present. Same output shape, same downstream pipeline.
  • scripts/backfill_sidecars.py — the has_samples short-circuit added during the histogram-codec-pending era still serves as a defensive guard against truly undecodable files, but no longer fires for valid histograms.

Companion reference

  • docs/waveform_codec_re_status.md — sibling status doc for the much-more-complex waveform-mode codec.
  • docs/instantel_protocol_reference.md §7.6.2 — historical protocol-reference entry. Structural framing matches what we found; per-sample semantics were less documented than the ✅ CONFIRMED badge suggested. This doc supersedes §7.6.2 where they conflict on confidence level.