histogram_codec: peak count is uint8 (not uint16 LE) — properly cracks

the BE9558 / BE18003 extension-byte case The bytes at [7]/[11]/[15]/[19] are an annotation field (purpose still unclear — empirically non-zero on intervals with sub-Hz or unmeasurable freq), NOT the high byte of the peak count. The N844 fixture corpus the original RE was done against had zero values in those bytes for every block, so uint8 and uint16 LE were equivalent there — but on real BE9558 Tran-drift events and BE18003 Histogram+Continuous events the uint16 LE interpretation produced peaks up to 268 in/s and 35× inflated PVS sums. Cross-correlated against BW's per-interval ASCII export on: - K558LKZU/LL1P/LL3K → 100% T/V/L/M peak match (1435 blocks each) - T003LKZR/LL0O/LL1M → 100% T/V/L, 99.3% M (0.05 dB rounding only) - N599LKZS/LL0L → 100% all channels - N844 fixture corpus → 100% all channels (unchanged) Annotations preserved on every record for future RE; the defensive _MAX_PEAK_COUNT bound is no longer needed (uint8 maxes at 1.275 in/s, well below any physical limit). Synthetic regression test added using the verbatim K558LKZU.RE0H interval-12 block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 06:05:19 +00:00
parent e949232875
commit d506ebc103
4 changed files with 138 additions and 51 deletions
@@ -28,18 +28,32 @@ iterate 32-stride and stop before the tail.
    [1]    segment_id  (uint8)       0x00..0x03 — 256 blocks per segment
    [2:4]  block_ctr  (uint16 LE)    resets each segment (0x0100, 0x0101, …)
    [4:6]  0x000a (uint16 LE)        constant marker (= 10)
-    [6:8]  T_peak_count   uint16 LE  Tran peak (count × 0.005 → in/s)
+    [6]    T_peak_count   uint8      Tran peak (count × 0.005 → in/s, max 1.275 in/s)
+    [7]    T_annotation   uint8      empirically non-zero on intervals with sub-Hz
+                                     or unmeasurable Tran freq; meaning not fully RE'd
    [8:10] T_halfperiod   uint16 LE  Tran half-period in samples (freq = 512 / halfp Hz)
-    [10:12] V_peak_count  uint16 LE
+    [10]   V_peak_count   uint8
+    [11]   V_annotation   uint8
    [12:14] V_halfperiod  uint16 LE
-    [14:16] L_peak_count  uint16 LE
+    [14]   L_peak_count   uint8
+    [15]   L_annotation   uint8
    [16:18] L_halfperiod  uint16 LE
-    [18:20] M_peak_count  uint16 LE  MicL peak (count → dB via mic_count_to_db)
+    [18]   M_peak_count   uint8      MicL peak (count → dB via mic_count_to_db)
+    [19]   M_annotation   uint8
    [20:22] M_halfperiod  uint16 LE  MicL half-period in samples (freq = 512 / halfp Hz)
    [22:24] 0x00 0x00                constant
    [24:28] 4-byte variable          purpose unknown (possibly CRC or timestamp delta)
    [28:32] 0x1e 0x0a 0x00 0x00      constant block-end signature

+NOTE on peak-count width: an earlier interpretation treated the peak
+fields as uint16 LE spanning [6:8] / [10:12] / [14:16] / [18:20].
+That happened to be byte-exact against the N844 fixture corpus only
+because every annotation byte in those fixtures was zero, making
+``uint16 LE == uint8``.  Cross-correlating BE9558 (K558) Tran-drift
+and BE18003 (T003) Histogram+Continuous events against the BW ASCII
+export proved peak is uint8 alone — see test_histogram_codec.py
+and docs/histogram_codec_re_status.md.
+
 Block-identification anchor: ``block[22:24] == b"\\x00\\x00"`` AND
 ``block[28:32] == b"\\x1e\\x0a\\x00\\x00"``.  This is the reliable
 distinguisher from non-block content in the file.
@@ -101,30 +115,6 @@ _BLOCK_SIZE = 32
 # additional validation that we're looking at a real block.
 _BLOCK_MARKER = 10

-# Maximum plausible peak-count value.  The geophone tops out at 10 in/s
-# at Normal range = 2000 counts at the 0.005 in/s per count scale.
-# Sensitive range (1.25 in/s FS) tops at ~250.  Mic peak counts have
-# been observed up to ~400 (≈ 100 dB(L)) and per the protocol doc can
-# reach ~813 (140 dB(L)).  2200 covers Normal full-scale plus ~10%
-# headroom for quantization edge cases while keeping every physically
-# implausible value out of the PVS computation.
-#
-# Some prod blocks have been observed with peak-count fields whose
-# HIGH byte is non-zero (block[7] != 0 etc.) — observed across BE9558
-# and BE18003 units in Histogram-mode events.  Reading these as
-# uint16 LE produces values like 30981 / 41733 / 62469, which scale
-# to physically impossible peaks (150+ in/s).  Best guess: an
-# undocumented "time-of-peak-within-interval" extension byte the
-# device writes in some sub-mode (possibly Histogram+Continuous).
-# Until reverse-engineered, blocks exceeding this bound are skipped
-# rather than propagating bogus values into PVS computations.
-#
-# Earlier we tried 4096 — that allowed peak counts up to 4096 × 0.005
-# = 20.48 in/s per channel, which produced 35× inflated PVS sums when
-# the extension-byte blocks slipped through.  See feat/wire-histogram-codec
-# branch history for the rollback.
-_MAX_PEAK_COUNT = 2200
-
 # Geo peak scaling: stored as "count × 0.005 in/s" where 1 count = one
 # 0.005 in/s display quantum.  Equivalent to the waveform codec's
 # 16-count-unit output (1 unit = 0.005 in/s = 16 ADC counts).
@@ -156,23 +146,36 @@ def _decode_block(block: bytes) -> Optional[dict]:
    """Decode one 32-byte histogram block.  Caller must have validated
    with ``_is_data_block`` first.

-    Returns ``None`` if any peak field exceeds ``_MAX_PEAK_COUNT`` —
-    those blocks contain an undocumented extension byte format whose
-    naive uint16 LE interpretation gives physically impossible peaks.
-    Skipping the block is safer than propagating bogus values into
-    PVS computations downstream.
+    Returns a record with per-channel peak counts (uint8) and
+    half-periods (uint16 LE).
    """
-    # All 16-bit fields are little-endian unsigned.  Peak counts are
-    # always non-negative; half-periods are always positive when valid.
-    t_peak, t_halfp, v_peak, v_halfp, l_peak, l_halfp, m_peak, m_halfp = struct.unpack_from(
-        "<HHHHHHHH", block, 6
-    )
-    if (t_peak > _MAX_PEAK_COUNT or v_peak > _MAX_PEAK_COUNT
-            or l_peak > _MAX_PEAK_COUNT or m_peak > _MAX_PEAK_COUNT):
-        return None
+    # Peak counts are uint8 at bytes [6] / [10] / [14] / [18].  The
+    # adjacent bytes [7] / [11] / [15] / [19] hold an annotation field
+    # whose meaning isn't fully understood (empirically non-zero in
+    # intervals with sub-Hz or unmeasurable geo frequencies, mostly
+    # zero otherwise — see test fixtures from BE9558/BE18003 corpora).
+    # Crucially, those annotation bytes are NOT the high byte of the
+    # peak count: cross-correlating against BW's per-interval ASCII
+    # export proves the peak is uint8 alone.
+    #
+    # Reading the peak as uint16 LE (the original interpretation) was
+    # accidentally correct only because every block in the N844 fixture
+    # corpus had a zero annotation byte; non-N844 events with non-zero
+    # annotation bytes decoded to physically impossible peaks (e.g.
+    # 268 in/s per channel) and produced 35× inflated PVS sums when
+    # first run against prod data.  See histogram_codec_re_status.md.
+    t_peak = block[6]
+    v_peak = block[10]
+    l_peak = block[14]
+    m_peak = block[18]
+    t_halfp = block[8]  | (block[9]  << 8)
+    v_halfp = block[12] | (block[13] << 8)
+    l_halfp = block[16] | (block[17] << 8)
+    m_halfp = block[20] | (block[21] << 8)
    segment_id = block[1]
    block_ctr  = block[2] | (block[3] << 8)
    var_meta   = bytes(block[24:28])
+    annotations = (block[7], block[11], block[15], block[19])
    return {
        "segment_id":  segment_id,
        "block_ctr":   block_ctr,
@@ -185,6 +188,7 @@ def _decode_block(block: bytes) -> Optional[dict]:
        "m_peak":      m_peak,
        "m_halfp":     m_halfp,
        "meta_var":    var_meta,
+        "annotations": annotations,
    }


@@ -192,10 +196,15 @@ def walk_body(body: bytes) -> List[dict]:
    """Walk the body and return one dict per histogram interval.

    Iterates 32-byte strides from offset 0.  Yields a decoded record
-    for every block that passes ``_is_data_block`` validation AND has
-    plausible peak values (``_decode_block`` returns None for blocks
-    with out-of-bound peaks).  Stops when the remaining bytes are too
-    short to form a complete block.
+    for every block that passes ``_is_data_block`` validation.  Stops
+    when the remaining bytes are too short to form a complete block.
+
+    In Histogram+Continuous mode the body interleaves data blocks with
+    other 32-byte content (likely continuous-mode waveform blocks) that
+    fail the data-block validation; the walker naturally skips them
+    without losing 32-byte alignment.  Use ``block_ctr`` from each
+    returned record to map back to the original interval index — the
+    record list is sparse when other block types are interleaved.
    """
    records: List[dict] = []
    for off in range(0, len(body) - _BLOCK_SIZE + 1, _BLOCK_SIZE):