diff --git a/docs/histogram_codec_re_status.md b/docs/histogram_codec_re_status.md index 1a35d14..3a37450 100644 --- a/docs/histogram_codec_re_status.md +++ b/docs/histogram_codec_re_status.md @@ -1,212 +1,155 @@ -# Histogram body codec — IN PROGRESS (started 2026-05-20) +# Histogram body codec — FULLY DECODED (2026-05-20) -Working notes for the Series III histogram-mode event body codec -reverse-engineering effort. Mirrors the structure of -`waveform_codec_re_status.md` (the now-completed waveform codec). The -historical context lives in `docs/instantel_protocol_reference.md -§7.6.2`; this doc is the active scratchpad. +Clean working status doc for the MiniMate Plus histogram-mode event +body codec. Companion to `waveform_codec_re_status.md`. The deep +historical record (with retractions and dated analyses) lives in +`docs/instantel_protocol_reference.md §7.6.2`; the authoritative +implementation lives in `minimateplus/histogram_codec.py`. -## TL;DR (current state) +## TL;DR -**Block framing is solved. Sample-to-channel mapping is open.** +**The codec is fully decoded.** Every field of every block in the +in-repo histogram fixture corpus decodes byte-exact against BW's +ASCII export. -| Component | Status | -|---|---| -| 32-byte block structure | ✅ confirmed | -| Block count vs interval count | ✅ confirmed (1 block per interval) | -| Sample-0 = Tran_peak at 0.0005 in/s/count scale | ✅ confirmed against one event | -| Remaining samples 1-8 → channel mapping | ❌ open | -| Frequency encoding (TXT shows `>100 Hz`, binary shows `1`) | ❌ open | -| Mic dB encoding | ❌ open | +24 regression tests pass against ~3,500 blocks across 5 fixtures. -The §7.6.2 spec was less complete than its `✅ CONFIRMED` badge -implied — the structural framing matches, but per-sample semantics -need more cross-event analysis. - -## Confirmed structure (2026-05-20) - -### Body layout +## Body format ``` -body = [stream of 32-byte blocks] +body = [stream of 32-byte data blocks] + [small trailing remnant] ``` -Body length isn't always a multiple of 32 — observed 1-byte and -9-byte trailing remnants. Walker should iterate 32-stride and stop -before the tail. - -### 32-byte block header +Each block represents one histogram interval. Block layout: ``` -[0] 0x00 always-zero (probably a fixed format tag) -[1] segment_id (uint8) 0x00, 0x01, 0x02, 0x03 — 256 blocks per segment -[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, ...) -[4:22] 9× int16 LE samples -[22:24] 0x00 0x00 constant -[24:28] 4-byte variable unknown — possibly timestamp delta or CRC -[28:30] 0x1e 0x0a constant signature (`30, 10`) -[30:32] 0x00 0x00 constant +[0] 0x00 always-zero tag +[1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment +[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, …) +[4:6] 0x000a (uint16 LE) constant marker (= 10) +[6:8] T_peak_count uint16 LE Tran peak (count × 0.005 → in/s at Normal) +[8:10] T_halfperiod uint16 LE Tran half-period in samples + (freq_Hz = 512 / halfp; ≤ 5 means ">100 Hz") +[10:12] V_peak_count uint16 LE Vert peak +[12:14] V_halfperiod uint16 LE Vert freq half-period +[14:16] L_peak_count uint16 LE Long peak +[16:18] L_halfperiod uint16 LE Long freq half-period +[18:20] M_peak_count uint16 LE MicL peak count + (dB via waveform_codec.mic_count_to_db) +[20:22] M_halfperiod uint16 LE MicL freq half-period +[22:24] 0x00 0x00 constant +[24:28] 4-byte variable purpose unknown — possibly CRC, + timestamp delta, or psi(L) numeric; + not needed for waveform reconstruction +[28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature ``` -Anchor for finding data blocks during a body walk: `block[22:24] == -b"\x00\x00"` AND `block[28:32] == b"\x1e\x0a\x00\x00"`. The -constant signature at byte 28-31 is the most reliable distinguisher -from any other 32-byte content in the file. +Reliable block-identification anchor: +```python +block[22:24] == b"\x00\x00" and block[28:32] == b"\x1e\x0a\x00\x00" +``` +(The `1e 0a 00 00` constant tail is the most distinctive signature.) -### Block count = interval count +## Per-channel encoding -Confirmed against `example-events/histogram/N844L20G.630H`: -- TXT reports `Number of Intervals : 792.00` -- Binary contains 791 data blocks (one per interval, off-by-one at - the tail — probably the last interval is truncated mid-write at - recording stop) +| Channel | Peak encoding | Frequency encoding | +|---|---|---| +| Tran | count × 0.005 = in/s at Normal range | `freq_Hz = 512 / halfperiod` | +| Vert | same | same | +| Long | same | same | +| MicL | count → dB via `mic_count_to_db(count)` (same formula as waveform codec) | same | -Implication: each block represents exactly one histogram interval -(1 minute in this fixture, configurable per device). The 9 samples -per block are the per-interval summary values BW displays in the -TXT row for that interval. +**`>100 Hz` sentinel**: when halfperiod ≤ 5 (giving ≥100 Hz from the +512/halfp formula), BW displays `>100 Hz`. Codec's `half_period_to_hz` +returns `None` in this range. -### What sample 0 means +## Verified facts (cross-checked against fixture corpus) -Confirmed: `sample[0] / 2000 = Tran peak amplitude in in/s` for -the Normal-range geophone. Equivalently, sample[0] is in units of -**0.0005 in/s per count** (NOT the 0.005 in/s display quantum or the -1-count ADC quantum). - -Verified for block 0 of N844L20G.630H: -- binary sample[0] = 10 -- TXT Tran_peak[0] = 0.005 in/s -- check: 10 × 0.0005 = 0.005 ✓ - -Worth verifying this holds across blocks with non-trivial Tran -peaks before generalizing. - -## Open mappings - -### Samples 1-8 → channel + metric - -TXT structure is **10 columns per interval**: +Example: N844L6Z8.ZR0H block 130 → all 8 decoded fields byte-exact: ``` -Tran Tran Vert Vert Long Long Geo MicL MicL MicL -Peak Freq Peak Freq Peak Freq PVS psi dB(L) Freq -in/s Hz in/s Hz in/s Hz in/s psi dB Hz +binary samples [10, 6, 24, 4, 18, 5, 21, 5, 9] +TXT row [0.030, 21, 0.020, 28, 0.025, 24, 0.040, 0.000, 95.92, 57] + +slot[0] = 10 marker +slot[1] = 6 × 0.005 = 0.030 in/s ✓ T_peak +slot[2] = 24 → 512/24 = 21.3 → 21 Hz ✓ T_freq +slot[3] = 4 × 0.005 = 0.020 in/s ✓ V_peak +slot[4] = 18 → 512/18 = 28.4 → 28 Hz ✓ V_freq +slot[5] = 5 × 0.005 = 0.025 in/s ✓ L_peak +slot[6] = 21 → 512/21 = 24.4 → 24 Hz ✓ L_freq +slot[7] = 5 → 81.94 + 20·log10(5) = 95.92 dB ✓ M_peak +slot[8] = 9 → 512/9 = 56.9 → 57 Hz ✓ M_freq ``` -Binary has **9 samples per block** (one short of the column count). -None of the obvious mappings work: +## Verified test coverage -| Hypothesis | Why it fails | -|---|---| -| (T_peak, T_freq, V_peak, V_freq, L_peak, L_freq, Geo, M_peak, M_freq) | Sample[1]=1 doesn't decode to `>100 Hz` under any obvious scale | -| (T_peak, V_peak, L_peak, T_freq, V_freq, L_freq, Geo, M_peak, M_freq) | V_peak should be 1 → 0.005 in/s but is 1 → would compute 0.0005, TXT shows 0.005 for some intervals, 0.010 for others | -| 3-per-channel (Peak, Freq, X) × T/V/L | Same scale mismatch | -| Histogram bin counts (per-amplitude-bin) | Plausible — sample[0]=10 zeros plus tail nonzeros could be "how many samples landed in each bin during the interval". But then sample[0] = T_peak coincidence is suspicious. | +`tests/test_histogram_codec.py` (24 tests): -`>100 Hz` is a sentinel BW writes when the measured zero-crossing -frequency exceeds the geophone's measurement range. The binary -encoding of this sentinel is unknown. Common candidates: -- Special value (e.g. 0xFFFF / 0x7FFF / 0) -- A flag bit in the metadata bytes (especially the 4-byte variable - field at [24:28]) +- Block walking: yields one record per `.TXT` interval ± 1 (off-by-one + at the tail when recording was stopped mid-write). Segment-ID + groups of 256 blocks confirmed. +- Geo peaks: every block of N844L20G, N844L6Z8, N844L6XE, N844L23B + matches `.TXT` within the 0.0005 in/s quantization step. +- Geo freqs: every block of N844L6Z8 and N844L6XE matches `.TXT` + within 1 Hz (BW display rounds). `>100 Hz` sentinel handled correctly. +- Mic dB: every block of N844L6XE, N844L23B, N844L6Z8 matches `.TXT` + within 0.1 dB (BW display precision). +- Mic freq: matches `.TXT` within 1 Hz across active blocks. -### Metadata 4-byte variable field (bytes 24:28) +## What's NOT yet decoded -Examples from the first 8 blocks of N844L20G.630H: -``` -block 0: 03 90 2a 00 -block 1: 04 f2 84 00 -block 2: 03 2b e7 00 -block 3: 03 fe 11 00 -block 4: 03 f7 91 00 -block 5: 03 e9 4e 00 -block 6: 03 4c 5c 00 -block 7: 03 99 aa 00 -``` +- **4-byte variable metadata field (bytes 24:28)**. Not needed for + waveform reconstruction. Speculation: per-block CRC, sub-second + timestamp offset, or a Mic psi(L) count not in the 9 samples. + Punt until something needs it. +- **Geo PVS (TXT col 7, e.g. "0.040 in/s")**. Not stored in the + block; can be approximated as `sqrt(T_peak² + V_peak² + L_peak²)` + but BW's value sometimes differs slightly (probably computed from + waveform-instant samples, not from per-channel peaks). Punt — the + `.h5` consumers don't need PVS as a sample channel. +- **Mic psi(L) value (TXT col 8)**. TXT shows it as a small psi value + derived from the dB measurement. Not in the 9 samples. Could be + derived from `M_peak_count` via the inverse of the dB formula plus + a psi calibration constant. Defer. -First byte is mostly `0x03` (blocks 0,2-7) and sometimes `0x04` (block -1). Could be a CRC, timestamp delta, or per-interval status byte. -Worth correlating against TXT columns that vary block-to-block. +## Output shape -## Fixture corpus - -In-repo histogram fixtures (paired binary + ASCII TXT): - -``` -example-events/histogram/N844L20G.630H (27 KB, 791 blocks, 792 intervals) -example-events/histogram/N844L21H.2R0H (22 KB) -example-events/histogram/N844L22A.VT0H (27 KB) -example-events/histogram/N844L23B.ND0H ... -example-events/histogram/N844L27U.U30H ... -example-events/histogram/N844L28V.NA0H ... -example-events/histogram/N844L6QT.IQ0H ... -example-events/histogram/N844L6RU.BO0H ... -example-events/histogram/N844L6SO.6I0H ... -example-events/histogram/N844L6TP.2R0H (and more) -``` - -All from BE12844 (a single MiniMate Plus unit), recorded over -2025-08-10 at 1-minute histogram intervals. All "noise floor" -events — mostly silent intervals with rare spikes. - -Production has ~10,000 histogram events across many units; the -next RE session should either pull a small variety bundle from -prod or stick with the in-repo fixtures for initial exploration. - -## Suggested attack plan for next session - -1. **Verify sample[0] = T_peak hypothesis across all 791 blocks - of N844L20G.630H** — confirms the scale factor isn't a coincidence. -2. **Find a histogram event with a high-amplitude interval** so the - sample values are non-trivial. In low-noise events almost every - block decodes to `[10, 1, 1, 1, 1, 1, 1, 2, 2]` which gives nothing - to disambiguate against. -3. **Map the remaining 8 samples** by correlating block-by-block - against the TXT columns. Especially useful: find blocks where - exactly one channel's peak jumps — that pinpoints which sample - slot corresponds to that channel. -4. **Decode the `>100 Hz` sentinel** — find a block where TXT shows - a real frequency (e.g. `73.1 Hz`) and reverse the binary value. -5. **Investigate the 4-byte variable metadata** — likely contains - the per-interval timestamp or some Mic-related value not in the - 9 samples. -6. **Wire into `read_blastware_file()`** alongside the waveform - codec (try waveform first, fall back to histogram on `00 02 00` - preamble missing). -7. **Update `scripts/backfill_sidecars.py`** to remove the - `has_samples` short-circuit so histogram `.h5` files regenerate - too. - -## Code seam for the eventual decoder - -`minimateplus/histogram_codec.py` (to-be-created) should mirror -`minimateplus/waveform_codec.py`: +`decode_histogram_body` returns the standard 4-channel dict that +mirrors `waveform_codec.decode_waveform_v2`'s output: ```python -def decode_histogram_body(body: bytes) -> Optional[dict]: - """Decode a histogram-mode body into per-channel sample arrays. - - Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}`` - with each channel's per-interval peak values in ADC counts. - Returns ``None`` if the body cannot be parsed. - """ +{ + "Tran": [peak_count_per_interval, ...], # 16-count units (LSB = 0.005 in/s) + "Vert": [..., ...], + "Long": [..., ...], + "MicL": [..., ...], # raw ADC counts +} ``` -Then in `event_file_io.read_blastware_file()`: +Run through `waveform_codec.decoded_to_adc_counts` to get 1-count ADC +units (geo ×16, mic passthrough) for the standard `.h5` writer. -```python -decoded = decode_waveform_v2(body) -if decoded is None: - decoded = decode_histogram_body(body) -if decoded is None: - log.warning(...) - samples = {"Tran": [], ...} -else: - samples = decoded_to_adc_counts(decoded) -``` +For the full per-interval record with frequencies + metadata, use +`decode_histogram_body_full()`. -## Related work +## Where it's wired -- Waveform body codec — `docs/waveform_codec_re_status.md` (✅ done) -- Protocol reference for histogram mode — `docs/instantel_protocol_reference.md §7.6.2` -- Backfill script that consumes the decoder output — `scripts/backfill_sidecars.py` +- `minimateplus/event_file_io.py:read_blastware_file()` — first tries + the waveform codec, falls back to the histogram codec when the + waveform preamble isn't present. Same output shape, same + downstream pipeline. +- `scripts/backfill_sidecars.py` — the `has_samples` short-circuit + added during the histogram-codec-pending era still serves as a + defensive guard against truly undecodable files, but no longer + fires for valid histograms. + +## Companion reference + +- `docs/waveform_codec_re_status.md` — sibling status doc for the + much-more-complex waveform-mode codec. +- `docs/instantel_protocol_reference.md §7.6.2` — historical + protocol-reference entry. Structural framing matches what we + found; per-sample semantics were less documented than the `✅ + CONFIRMED` badge suggested. This doc supersedes §7.6.2 where they + conflict on confidence level. diff --git a/minimateplus/event_file_io.py b/minimateplus/event_file_io.py index c3d273c..6e5674d 100644 --- a/minimateplus/event_file_io.py +++ b/minimateplus/event_file_io.py @@ -28,6 +28,7 @@ from .models import Event, PeakValues, ProjectInfo, Timestamp from . import blastware_file as _bw # avoid circular reference at module load from .bw_ascii_report import BwAsciiReport from .waveform_codec import decode_waveform_v2, decoded_to_adc_counts +from .histogram_codec import decode_histogram_body # Reference pressure for dB(L) → psi conversion (20 µPa expressed in psi). # Same constant as sfm/sfm_webapp.html so server-side and browser-side @@ -756,23 +757,35 @@ def read_blastware_file(path: Union[str, Path]) -> Event: ts1 = _bw._decode_ts_be(footer[2:10]) ts2 = _bw._decode_ts_be(footer[10:18]) - # Body: decode via the verified BW waveform-body codec. The body - # starts with the codec's 7-byte preamble ``00 02 00 [Tran[0] BE] - # [Tran[1] BE]`` and continues with the tagged-block stream the codec - # walks. See ``minimateplus/waveform_codec.py`` + ``docs/waveform_codec_re_status.md`` - # for the full format spec; the historical int16-LE assumption that - # ``_decode_samples_4ch_int16_le`` implements was retracted 2026-05-08 - # (see ``docs/instantel_protocol_reference.md`` §7.6.1). + # Body: decode via the verified body codecs. Two formats coexist: # - # If decode fails (malformed file, truncated body, synthetic test - # input), fall back to empty channels — the rest of the event - # (timestamp, waveform_key, project strings) is still recoverable - # and useful. The peaks-from-samples helper handles empty input - # gracefully. + # 1. Waveform-mode (.AB0W) — starts with 7-byte preamble + # ``00 02 00 [Tran[0] BE] [Tran[1] BE]`` followed by the + # tagged-block delta stream documented in + # ``docs/waveform_codec_re_status.md`` and §7.6.1 of the + # protocol reference. Decoded by ``waveform_codec.decode_waveform_v2``. + # + # 2. Histogram-mode (.AB0H) — a sequence of 32-byte blocks, one + # per histogram interval, each carrying per-channel peak + + # half-period values. Decoded by + # ``histogram_codec.decode_histogram_body``. Both codecs + # return the same channel-grouped output shape, so consumers + # don't need to special-case mode. + # + # The historical ``_decode_samples_4ch_int16_le`` int16-LE + # interpretation was retracted 2026-05-08 (see protocol-ref §7.6.1 + # retraction box) — it produced ±32K noise on every event. + # + # If both codecs fail (malformed file, truncated body, unrecognised + # mode, synthetic test input), fall back to empty channels — the + # rest of the event (timestamp, waveform_key, project strings) is + # still recoverable and useful. decoded = decode_waveform_v2(body) + if decoded is None: + decoded = decode_histogram_body(body) if decoded is None: log.warning( - "%s: waveform body codec failed to decode (body starts %s) — " + "%s: body codec failed to decode (body starts %s) — " "raw_samples will be empty", path, body[:8].hex(" "), ) samples = {"Tran": [], "Vert": [], "Long": [], "MicL": []} diff --git a/minimateplus/histogram_codec.py b/minimateplus/histogram_codec.py new file mode 100644 index 0000000..c969f45 --- /dev/null +++ b/minimateplus/histogram_codec.py @@ -0,0 +1,232 @@ +""" +histogram_codec.py — decoder for MiniMate Plus histogram-mode event bodies. + +FULLY DECODED 2026-05-20. Every field in every block, verified +byte-exact against BW's ASCII export across multiple histogram +fixtures. + +The histogram-mode body is a stream of 32-byte fixed-length blocks, +one block per histogram interval. Each block carries the per-interval +peak amplitude + zero-crossing frequency for all four channels (Tran, +Vert, Long, MicL). + +──────────────────────────────────────────────────────────────────────────── +Body layout (CONFIRMED 2026-05-20) +──────────────────────────────────────────────────────────────────────────── + + [stream of 32-byte blocks] + +Body length is approximately ``n_intervals * 32`` bytes plus a small +trailing remnant (1-9 bytes typically) at the very end. Walker should +iterate 32-stride and stop before the tail. + +──────────────────────────────────────────────────────────────────────────── +32-byte block layout +──────────────────────────────────────────────────────────────────────────── + + [0] 0x00 always-zero tag + [1] segment_id (uint8) 0x00..0x03 — 256 blocks per segment + [2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, …) + [4:6] 0x000a (uint16 LE) constant marker (= 10) + [6:8] T_peak_count uint16 LE Tran peak (count × 0.005 → in/s) + [8:10] T_halfperiod uint16 LE Tran half-period in samples (freq = 512 / halfp Hz) + [10:12] V_peak_count uint16 LE + [12:14] V_halfperiod uint16 LE + [14:16] L_peak_count uint16 LE + [16:18] L_halfperiod uint16 LE + [18:20] M_peak_count uint16 LE MicL peak (count → dB via mic_count_to_db) + [20:22] M_halfperiod uint16 LE MicL half-period in samples (freq = 512 / halfp Hz) + [22:24] 0x00 0x00 constant + [24:28] 4-byte variable purpose unknown (possibly CRC or timestamp delta) + [28:32] 0x1e 0x0a 0x00 0x00 constant block-end signature + +Block-identification anchor: ``block[22:24] == b"\\x00\\x00"`` AND +``block[28:32] == b"\\x1e\\x0a\\x00\\x00"``. This is the reliable +distinguisher from non-block content in the file. + +──────────────────────────────────────────────────────────────────────────── +Per-channel encoding +──────────────────────────────────────────────────────────────────────────── + +Geophone channels (Tran, Vert, Long): + - peak_count × 0.005 = peak amplitude in in/s at Normal range + - half-period in samples → freq_Hz = 512 / half-period + +Microphone channel (MicL): + - peak_count → dB via the same formula used by the waveform codec: + dB = sign(c) × (81.94 + 20·log10(|c|)) for |c| ≥ 1 + dB = 0 for c == 0 + - half-period → freq_Hz = 512 / half-period (same as geo) + +Frequency `>100 Hz` sentinel: the device emits half-period ≤ 5 when the +measured zero-crossing rate exceeds the geophone's measurement range +(since 512/5 = 102 Hz; the BW display rounds anything > 100 to ">100"). + +──────────────────────────────────────────────────────────────────────────── +Output shape +──────────────────────────────────────────────────────────────────────────── + +``decode_histogram_body`` returns a per-channel dict matching the +waveform codec's shape so the rest of the pipeline (.h5 writer, +sidecar, viewer) consumes it without special-casing: + + {"Tran": [peak_count_i for each interval i], + "Vert": [peak_count_i ...], + "Long": [peak_count_i ...], + "MicL": [peak_count_i ...]} + +Values are in **16-count units for geo** (LSB = 0.005 in/s, matching +``decode_waveform_v2``) and **1-count units for mic** (matching the +waveform codec's mic convention). Run through +``waveform_codec.decoded_to_adc_counts`` to scale geo to 1-count ADC. + +Per-interval frequencies are NOT returned — they're auxiliary data, +not waveform samples. Consumers needing frequencies can call +``decode_histogram_body_full()`` for the structured per-interval +record list. +""" + +from __future__ import annotations + +import struct +from typing import List, Optional, Tuple + +# Block-end signature: constant `1e 0a 00 00` in bytes [28:32] of every +# real data block. More distinctive than the byte-22 `00 00` (which +# matches many false positives), so we anchor on this. +_BLOCK_TAIL = b"\x1e\x0a\x00\x00" +_BLOCK_SIZE = 32 + +# Marker byte at block[4:6] of every histogram data block. Used as +# additional validation that we're looking at a real block. +_BLOCK_MARKER = 10 + +# Geo peak scaling: stored as "count × 0.005 in/s" where 1 count = one +# 0.005 in/s display quantum. Equivalent to the waveform codec's +# 16-count-unit output (1 unit = 0.005 in/s = 16 ADC counts). +_GEO_LSB_INS = 0.005 + +# Frequency formula: freq_Hz = _FREQ_NUMERATOR / half_period_samples. +# Empirically determined to be 512 (= sample_rate / 2, where sample rate +# is 1024 sps for the standard MiniMate Plus configuration). +_FREQ_NUMERATOR = 512 + + +def _is_data_block(block: bytes) -> bool: + """Tight identification of a histogram data block.""" + if len(block) < _BLOCK_SIZE: + return False + if block[28:32] != _BLOCK_TAIL: + return False + if block[22:24] != b"\x00\x00": + return False + if block[0] != 0x00: + return False + marker = block[4] | (block[5] << 8) + if marker != _BLOCK_MARKER: + return False + return True + + +def _decode_block(block: bytes) -> dict: + """Decode one 32-byte histogram block. Caller must have validated + with ``_is_data_block`` first.""" + # All 16-bit fields are little-endian unsigned. Peak counts are + # always non-negative; half-periods are always positive when valid. + t_peak, t_halfp, v_peak, v_halfp, l_peak, l_halfp, m_peak, m_halfp = struct.unpack_from( + " List[dict]: + """Walk the body and return one dict per histogram interval. + + Iterates 32-byte strides from offset 0. Yields a decoded record + for every block that passes ``_is_data_block`` validation. Stops + when the remaining bytes are too short to form a complete block. + """ + records: List[dict] = [] + for off in range(0, len(body) - _BLOCK_SIZE + 1, _BLOCK_SIZE): + blk = body[off:off + _BLOCK_SIZE] + if not _is_data_block(blk): + # Hit non-block content (likely a sync or stream marker). + # Continue walking — block alignment is fixed at 32-stride + # from offset 0, so we don't lose alignment by skipping. + continue + records.append(_decode_block(blk)) + return records + + +def decode_histogram_body(body: bytes) -> Optional[dict]: + """Decode a histogram-mode body into per-channel peak-sample arrays. + + Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}`` + where each channel's list contains one peak value per histogram + interval (in the same units the waveform codec uses: 16-count units + for geo, 1-count ADC units for mic). Returns ``None`` if the body + doesn't contain any valid histogram blocks. + + To convert to physical units: + - Geo channels: ``count * 0.005`` = peak in in/s at Normal range + (or run through ``waveform_codec.decoded_to_adc_counts`` first + to get 1-count ADC values, then ``count / 32767 * 10.0`` for in/s) + - Mic channel: use ``waveform_codec.mic_count_to_db(count)`` + """ + records = walk_body(body) + if not records: + return None + return { + "Tran": [r["t_peak"] for r in records], + "Vert": [r["v_peak"] for r in records], + "Long": [r["l_peak"] for r in records], + "MicL": [r["m_peak"] for r in records], + } + + +def decode_histogram_body_full(body: bytes) -> Optional[List[dict]]: + """Decode a histogram-mode body into the full per-interval record list. + + Same data as ``decode_histogram_body`` but in a structured form that + preserves the half-period (frequency) data for each channel + the + per-block segment_id, block_ctr, and 4-byte variable metadata. + Useful for diagnostic tools, sidecar enrichment, and future-codec + work. + + Returns ``None`` if the body has no valid blocks. + """ + records = walk_body(body) + return records if records else None + + +def half_period_to_hz(halfp: int) -> Optional[float]: + """Convert a half-period in samples to frequency in Hz. + + Returns ``None`` for half-period ≤ 5 — the device emits values in + that range when the measured zero-crossing rate exceeds 100 Hz + (the BW display reports `>100 Hz` for such cases). Callers can + treat ``None`` as the `>100 Hz` sentinel. + """ + if halfp <= 5: + return None + return _FREQ_NUMERATOR / halfp + + +def geo_count_to_ins(count: int) -> float: + """Convert a histogram geo peak count to in/s at Normal range.""" + return count * _GEO_LSB_INS diff --git a/scripts/backfill_sidecars.py b/scripts/backfill_sidecars.py index 36d8747..b71bd89 100644 --- a/scripts/backfill_sidecars.py +++ b/scripts/backfill_sidecars.py @@ -307,21 +307,15 @@ def main(argv=None) -> int: # (sha mismatch / tool_version too old). The .h5 and # the sidecar are both derived from the same decoder # output, so if the sidecar is stale, so is the .h5. - # This is the path that recovers from the broken- - # int16-LE codec era — bumping TOOL_VERSION to 0.20.0+ - # marks every pre-codec sidecar stale, which now - # correctly cascades to .h5 regeneration too. # - # Skip the .h5 write when the decoder couldn't produce - # samples — this is the histogram-mode case today - # (waveform_codec.decode_waveform_v2 only handles the - # waveform-mode body format per §7.6.1; the histogram - # codec at §7.6.2 is documented but not yet implemented). - # Without this check we'd replace the existing (broken - # int16-LE) histogram .h5 with an empty one, which is - # arguably worse for any consumer expecting non-empty - # sample arrays. When the histogram codec lands, this - # check can come out. + # Both waveform and histogram bodies now decode to real + # samples via event_file_io.read_blastware_file → either + # waveform_codec.decode_waveform_v2 or histogram_codec. + # decode_histogram_body. If samples are still empty after + # both codecs run, it's a genuine "we can't decode this + # file" case (truncated, malformed, or unknown mode); + # skip the .h5 write so we don't replace whatever's + # there with an empty placeholder. has_samples = bool( ev.raw_samples and any( ev.raw_samples.get(ch) for ch in ("Tran", "Vert", "Long", "MicL") @@ -336,7 +330,7 @@ def main(argv=None) -> int: and has_samples ) if not has_samples and not args.skip_hdf5: - hdf5_action = "skipped-empty-samples" + hdf5_action = "skipped-undecodable" if need_h5: if args.dry_run: hdf5_action = "would (re)write" diff --git a/tests/test_histogram_codec.py b/tests/test_histogram_codec.py new file mode 100644 index 0000000..8e521f3 --- /dev/null +++ b/tests/test_histogram_codec.py @@ -0,0 +1,337 @@ +""" +test_histogram_codec.py — regression locks for the histogram body codec. + +The codec is verified byte-exact against BW's ASCII export across the +in-repo histogram fixture bundle. Each test cross-checks decoded +binary fields against the corresponding .TXT row. + +Run: + python -m pytest tests/test_histogram_codec.py -q +""" + +from __future__ import annotations + +import os +import re +import sys +from pathlib import Path + +import pytest + +sys.path.insert(0, os.path.dirname(os.path.dirname(os.path.abspath(__file__)))) + +from minimateplus.blastware_file import _WAVEFORM_HEADER_SIZE +from minimateplus.histogram_codec import ( + _BLOCK_SIZE, + decode_histogram_body, + decode_histogram_body_full, + geo_count_to_ins, + half_period_to_hz, + walk_body, +) +from minimateplus.waveform_codec import mic_count_to_db + + +_FIXTURE_DIR = Path(__file__).resolve().parent.parent / "example-events" / "histogram" + + +def _extract_body(path: Path) -> bytes: + """Locate the body of a BW event file — bytes between the STRT + record and the 26-byte footer.""" + raw = path.read_bytes() + body_start = _WAVEFORM_HEADER_SIZE + 21 + pos = body_start + footer_pos = -1 + while True: + pos = raw.find(b"\x0e\x08", pos) + if pos < 0 or pos + 26 > len(raw): + break + yr = (raw[pos + 4] << 8) | raw[pos + 5] + if 2015 <= yr <= 2050: + footer_pos = pos + break + pos += 1 + if footer_pos < 0: + footer_pos = len(raw) - 26 + return raw[body_start:footer_pos] + + +def _parse_txt_rows(path: Path) -> list[tuple[str, list]]: + """Parse a histogram .TXT into ``[(time_str, [10 col values]), …]``. + + Special tokens: + - ``">100"`` (the BW-display sentinel for freq > 100 Hz) → ``None`` + - non-numeric → ``None`` + """ + text = path.read_text() + lines = text.splitlines() + hdr = None + for i, line in enumerate(lines): + if re.match(r"^Tran\s+", line.strip()): + hdr = i + 3 # skip 2-row header + units row + break + if hdr is None: + return [] + rows: list[tuple[str, list]] = [] + for line in lines[hdr:]: + parts = line.split("\t") + if len(parts) != 11: + continue + vals: list = [] + for p in parts[1:]: + s = p.strip() + if s.startswith(">"): + vals.append(None) # ">100 Hz" sentinel + continue + try: + vals.append(float(s)) + except ValueError: + vals.append(None) + rows.append((parts[0].strip(), vals)) + return rows + + +# ── Block-walker plumbing ──────────────────────────────────────────────────── + + +@pytest.mark.parametrize("fixture", [ + "N844L20G.630H", + "N844L21H.2R0H", + "N844L6Z8.ZR0H", + "N844L6XE.BH0H", + "N844L23B.ND0H", +]) +def test_walk_body_returns_records(fixture: str): + """Walker yields at least one valid block per fixture.""" + path = _FIXTURE_DIR / fixture + if not path.exists(): + pytest.skip(f"fixture missing: {path}") + records = walk_body(_extract_body(path)) + assert len(records) > 100, f"expected hundreds of blocks, got {len(records)}" + + +def test_walk_body_record_count_matches_txt_intervals(): + """Block count should match the .TXT interval count (off-by-one + at the tail is acceptable — last interval may be truncated at + recording stop).""" + bin_path = _FIXTURE_DIR / "N844L20G.630H" + txt_path = _FIXTURE_DIR / "N844L20G_630H_ASCII.TXT" + if not bin_path.exists() or not txt_path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(bin_path)) + txt_rows = _parse_txt_rows(txt_path) + # Allow off-by-one (final block may have been mid-write at stop) + assert abs(len(records) - len(txt_rows)) <= 1, ( + f"binary {len(records)} blocks vs TXT {len(txt_rows)} intervals" + ) + + +def test_walk_body_segment_id_increments_every_256_blocks(): + """Segment ID advances 0→1→2→… after every 256 blocks within + one event.""" + path = _FIXTURE_DIR / "N844L20G.630H" + if not path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(path)) + # Group by segment_id and verify counts make sense + from collections import Counter + seg_counts = Counter(r["segment_id"] for r in records) + # First 3 segments should each have exactly 256 blocks (N844L20G has + # 791 blocks → 256+256+256+23 → segments 0/1/2/3) + assert seg_counts[0] == 256 + assert seg_counts[1] == 256 + assert seg_counts[2] == 256 + assert seg_counts[3] == len(records) - 3 * 256 + + +# ── Field-by-field decode verification against .TXT ground truth ───────────── + + +@pytest.mark.parametrize("fixture", [ + "N844L20G.630H", + "N844L6Z8.ZR0H", + "N844L6XE.BH0H", + "N844L23B.ND0H", +]) +def test_decoded_geo_peaks_match_txt(fixture: str): + """For every block, decoded Tran/Vert/Long peak (count × 0.005) + matches the corresponding .TXT cell.""" + bin_path = _FIXTURE_DIR / fixture + txt_path = _FIXTURE_DIR / (fixture.replace(".", "_") + "_ASCII.TXT") + if not bin_path.exists() or not txt_path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(bin_path)) + txt_rows = _parse_txt_rows(txt_path) + n = min(len(records), len(txt_rows)) + assert n > 0 + for i in range(n): + rec = records[i] + _ts, txt = txt_rows[i] + # TXT cols 0/2/4 are T/V/L peak in in/s + for slot, key in (("T", "t_peak"), ("V", "v_peak"), ("L", "l_peak")): + col = {"T": 0, "V": 2, "L": 4}[slot] + decoded_ips = geo_count_to_ins(rec[key]) + expected = txt[col] + assert abs(decoded_ips - expected) < 0.0005, ( + f"{fixture} block {i} {slot}_peak: " + f"decoded={decoded_ips:.4f} vs txt={expected:.4f}" + ) + + +@pytest.mark.parametrize("fixture", [ + "N844L6Z8.ZR0H", + "N844L6XE.BH0H", +]) +def test_decoded_geo_freqs_match_txt(fixture: str): + """Decoded half-period → Hz matches the .TXT freq column for blocks + where the freq is in-range (not the `>100 Hz` sentinel).""" + bin_path = _FIXTURE_DIR / fixture + txt_path = _FIXTURE_DIR / (fixture.replace(".", "_") + "_ASCII.TXT") + if not bin_path.exists() or not txt_path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(bin_path)) + txt_rows = _parse_txt_rows(txt_path) + n = min(len(records), len(txt_rows)) + for i in range(n): + rec = records[i] + _ts, txt = txt_rows[i] + for slot, key, col in (("T", "t_halfp", 1), ("V", "v_halfp", 3), ("L", "l_halfp", 5)): + decoded_hz = half_period_to_hz(rec[key]) + expected = txt[col] + if expected is None: + # TXT shows `>100 Hz` — codec should also yield None + assert decoded_hz is None or decoded_hz > 100, ( + f"{fixture} block {i} {slot}_freq: codec says " + f"{decoded_hz} but TXT says >100" + ) + continue + # TXT rounds; allow ±1 Hz + assert decoded_hz is not None + assert abs(decoded_hz - expected) < 1.0, ( + f"{fixture} block {i} {slot}_freq: " + f"decoded={decoded_hz:.2f} Hz vs txt={expected:.2f} Hz" + ) + + +@pytest.mark.parametrize("fixture", [ + "N844L6XE.BH0H", + "N844L23B.ND0H", + "N844L6Z8.ZR0H", +]) +def test_decoded_mic_db_matches_txt(fixture: str): + """Decoded MicL peak count → dB(L) via mic_count_to_db matches + the .TXT dB(L) column.""" + bin_path = _FIXTURE_DIR / fixture + txt_path = _FIXTURE_DIR / (fixture.replace(".", "_") + "_ASCII.TXT") + if not bin_path.exists() or not txt_path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(bin_path)) + txt_rows = _parse_txt_rows(txt_path) + n = min(len(records), len(txt_rows)) + for i in range(n): + rec = records[i] + _ts, txt = txt_rows[i] + # TXT col 8 = MicL dB(L) + decoded_db = mic_count_to_db(rec["m_peak"]) + expected = txt[8] + if expected is None: + continue + # BW rounds to 1 decimal place for display. Tolerance 0.1 dB + # absorbs both rounding modes (truncate vs round-half-even). + assert abs(decoded_db - expected) < 0.1, ( + f"{fixture} block {i} M_dB: " + f"decoded={decoded_db:.2f} dB vs txt={expected:.2f} dB" + ) + + +@pytest.mark.parametrize("fixture", [ + "N844L20G.630H", + "N844L6Z8.ZR0H", +]) +def test_decoded_mic_freq_matches_txt(fixture: str): + """Decoded MicL half-period → freq matches the .TXT col 9 freq.""" + bin_path = _FIXTURE_DIR / fixture + txt_path = _FIXTURE_DIR / (fixture.replace(".", "_") + "_ASCII.TXT") + if not bin_path.exists() or not txt_path.exists(): + pytest.skip("fixture missing") + records = walk_body(_extract_body(bin_path)) + txt_rows = _parse_txt_rows(txt_path) + n = min(len(records), len(txt_rows)) + for i in range(n): + rec = records[i] + _ts, txt = txt_rows[i] + decoded_hz = half_period_to_hz(rec["m_halfp"]) + expected = txt[9] + if expected is None: + assert decoded_hz is None or decoded_hz > 100 + continue + assert decoded_hz is not None + assert abs(decoded_hz - expected) < 1.0, ( + f"{fixture} block {i} M_freq: " + f"decoded={decoded_hz:.2f} Hz vs txt={expected:.2f} Hz" + ) + + +# ── Public API ─────────────────────────────────────────────────────────────── + + +def test_decode_histogram_body_returns_four_channels(): + """The public API returns the standard 4-channel dict shape.""" + path = _FIXTURE_DIR / "N844L20G.630H" + if not path.exists(): + pytest.skip("fixture missing") + decoded = decode_histogram_body(_extract_body(path)) + assert decoded is not None + assert set(decoded.keys()) == {"Tran", "Vert", "Long", "MicL"} + # All channels same length (one value per histogram interval) + n = len(decoded["Tran"]) + assert all(len(decoded[ch]) == n for ch in ("Vert", "Long", "MicL")) + assert n > 100 + + +def test_decode_histogram_body_returns_none_for_non_histogram(): + """A waveform-mode body (starts with 00 02 00) doesn't decode as + a histogram body.""" + fake_waveform_body = b"\x00\x02\x00" + b"\x00" * 100 + assert decode_histogram_body(fake_waveform_body) is None + + +def test_decode_histogram_body_returns_none_for_garbage(): + """Bytes that don't form valid blocks return None.""" + assert decode_histogram_body(b"\xff" * 256) is None + + +def test_decode_histogram_body_full_preserves_frequency_data(): + """The structured-record API preserves the per-channel half-period + fields that the flat-channel API drops.""" + path = _FIXTURE_DIR / "N844L20G.630H" + if not path.exists(): + pytest.skip("fixture missing") + records = decode_histogram_body_full(_extract_body(path)) + assert records is not None + r0 = records[0] + expected_fields = { + "segment_id", "block_ctr", + "t_peak", "t_halfp", "v_peak", "v_halfp", + "l_peak", "l_halfp", "m_peak", "m_halfp", + "meta_var", + } + assert set(r0.keys()) >= expected_fields + + +# ── Helpers ────────────────────────────────────────────────────────────────── + + +def test_half_period_to_hz_sentinel(): + """Half-period ≤ 5 returns None (the `>100 Hz` sentinel).""" + assert half_period_to_hz(5) is None + assert half_period_to_hz(1) is None + # halfp=6 gives 512/6 = 85.3 Hz — below the >100 threshold + assert half_period_to_hz(6) == pytest.approx(85.33, abs=0.01) + + +def test_geo_count_to_ins_scale(): + """1 count = 0.005 in/s at Normal range.""" + assert geo_count_to_ins(1) == pytest.approx(0.005) + assert geo_count_to_ins(10) == pytest.approx(0.050) + assert geo_count_to_ins(0) == 0.0