codec-re: solve waveform body block framing; per-byte sample mapping still open

Decoded the structural framing of the Blastware waveform body — the bytes
between the 21-byte STRT record and the 26-byte file footer.  The body is
a sequence of tagged variable-length blocks, NOT raw int16 LE.  Five tag
types (10/20/00/30/40 NN) and their lengths are now confirmed against the
4-event May 2026 fixture bundle.  Body splits cleanly into ~16 segments
(for a 1280-sample event) separated by 40 02 segment headers carrying a
monotonically incrementing uint32 LE counter at bytes [8:12].

What's done:
- minimateplus/waveform_codec.py — block walker, segment splitter, segment
  header parser.  decode_waveform_v2 is a stub returning None until the
  byte-to-sample mapping is solved; client.py is unchanged.
- tests/test_waveform_codec.py — 31 tests covering block detection, lengths,
  contiguous-walk, segment splitting, segment-header parsing, and counter
  monotonicity.  All pass.
- tests/fixtures/decode-re-5-8-26/ — bundled fixtures (4 events, BW binary
  + Blastware ASCII export each).
- docs/instantel_protocol_reference.md §7.6.1 — replaced retraction box
  with the verified structural decoding plus an explicit list of what's
  still open.

What's still open: the per-byte mapping inside 10 NN / 20 NN blocks.  96
channel-permutation × nibble-order × sign-convention combinations were
brute-force tested; none match BW's ASCII export to within ±1 ADC count.
The codec is more elaborate than uniform 4-bit deltas — likely a hybrid
variable-bit-width scheme with segment-anchor resync points.  Next
recommended step: capture an event with a known calibration tone to pin
down magnitude scaling.

Walker also bails out partway through event-b (open issue documented in
both the module and the protocol reference).
This commit is contained in:
Claude
2026-05-08 20:44:37 +00:00
committed by serversdown
parent 7bd0f8badf
commit d3f77d1d96
29 changed files with 10102 additions and 105 deletions
+138 -105
View File
@@ -860,127 +860,160 @@ MicL: 39 64 1D AA = 0.0000875 psi
---
#### 7.6.1 Blast / Waveform mode — ❌ NOT VERIFIED (retracted 2026-05-08)
#### 7.6.1 Blast / Waveform mode — 🟡 STRUCTURAL FRAMING DECODED (2026-05-08)
> ## ⚠️ RETRACTION (2026-05-08)
> **Status (2026-05-08):** Block-level framing is solved and verified
> against the 4-event May 8 2026 bundle (3 sec / 2 sec / 1 sec / 1 sec
> events captured live from BE11529). The per-byte mapping from block
> data to ADC samples is **still open** — the previous int16 LE claim
> is REFUTED (see history below).
>
> The "4-channel interleaved s16 LE, 8 bytes per sample-set" claim
> below was **never actually validated**. It got into this document
> because the decoder built around that assumption produced full-scale
> ±32K counts on every channel of the 4-2-26 capture, and the
> ±32K-shaped output was misread as "the signal must have saturated."
>
> Cross-checking the BW-reported peaks proves the opposite:
>
> | Channel | BW PPV (in/s) | Expected ADC counts at 10 in/s FS |
> |---|---|---|
> | Tran | 0.420 | **1,376** |
> | Vert | 3.870 | **12,686** |
> | Long | 0.495 | **1,622** |
>
> None of these are anywhere near ±32K saturation. No event in the
> project's archive (across all captures from 1-2-26 onward) has
> ever come close to saturation either. Yet the decoder has
> consistently produced ±32K-shaped noise on every event. The right
> conclusion is that the byte-to-sample interpretation has been wrong
> the whole time, NOT that every event happened to saturate.
>
> What's actually known about the body bytes:
>
> - The byte distribution is heavily skewed (24% `0x00`, 10.5% `0x10`,
> plus high frequencies of `0x01 / 0x04 / 0x0F / 0xF0 / 0xF1`). Lots
> of `10 XX` pairs. Reading them as LE int16 produces uniform ±32K
> noise — the signature of mis-aligned or encoded data.
> - The CHANGELOG note for v0.14.2 calls the body a "delta-encoded
> ADC stream" — that hint plus the byte distribution points toward
> a delta encoding with `0x10` as an escape marker, but no decoder
> has been worked out yet.
> - The histogram-mode codec in §7.6.2 IS verified and decoded
> correctly (different format: 32-byte blocks with 9× int16 LE
> samples + metadata). The same firmware emits both formats, so
> §7.6.2 may share encoding primitives with the waveform codec
> and is worth using as a structural hint when reverse-engineering.
>
> **Treat the spec below as a starting hypothesis to disprove, not
> ground truth.** The frame-layout pieces (STRT location, preamble,
> chunk header) appear correct; the per-byte sample interpretation
> is the open question.
> The earlier "4-channel interleaved s16 LE, 8 bytes per sample-set"
> claim was never validated and was wrong. No event in the project's
> archive ever came close to ADC saturation, yet the int16 LE decoder
> consistently produced full-scale ±32K noise — that was the signature
> of mis-aligned encoded data, not signal saturation.
4-channel interleaved signed 16-bit little-endian, 8 bytes per sample-set:
##### Body file layout
A Blastware waveform-file body (the variable-length section between
the 21-byte STRT record and the 26-byte file footer) is composed of
**tagged variable-length blocks**, NOT raw int16 samples.
```
[T_lo T_hi V_lo V_hi L_lo L_hi M_lo M_hi] × N sample-sets
[preamble: 7 or 9 bytes]
[stream of tagged blocks]
[trailer: per-channel summary blocks]
```
- **T** = Transverse (Tran), **V** = Vertical (Vert), **L** = Longitudinal (Long), **M** = Microphone
- Channel order follows the Blastware convention: Tran is always first (ch[0]).
- Encoding: signed int16 little-endian. Full scale = ±32768 counts.
- Sample rate: set by compliance config (typical: 1024 Hz for blast monitoring).
- Each A5 frame chunk carries a different number of waveform bytes. Frame sizes
are NOT multiples of 8, so naive concatenation scrambles channel assignments at
frame boundaries. **Always track cumulative byte offset mod 8 to correct alignment.**
**Preamble:** starts with the 4-byte magic ``00 02 00 00``. Single-shot
events have a 7-byte preamble; continuous events have a 9-byte preamble
(the 4 events in the May 8 2026 bundle split 2/2 between the two
lengths). Bytes [4:9] of the preamble appear to encode initial
per-channel state but the layout has not been pinned down — for some
events byte [4] equals truth Tran[0] in 16-count units (0.005 in/s
LSB), but other channel-byte assignments don't fit consistently.
**A5[0] frame layout:**
##### Block tags (CONFIRMED 2026-05-08)
Every block starts with a 2-byte tag. Five tag types are confirmed:
| Tag (hex) | Block type | On-wire length |
|-----------|-------------------------------------|-----------------------|
| ``10 NN`` | Small-delta data block | NN/2 + 2 bytes |
| ``20 NN`` | Literal data block (int8-shaped) | NN + 2 bytes |
| ``00 NN`` | 2-byte marker between data blocks | 2 bytes |
| ``30 NN`` | Trailer summary block | NN × 4 bytes |
| ``40 02`` | Segment header | 20 bytes (fixed) |
NN is always a multiple of 4. ``10 NN`` and ``20 NN`` data blocks
alternate with ``00 NN`` markers — every ``10/20 NN`` block is
followed by a ``00 NN`` marker before the next data block.
##### Segments
The body is divided into ~16 SEGMENTS for a 1280-sample event (= 1
segment per ~80 sample-sets), separated by ``40 02`` segment headers.
A 3328-sample event has ~42 segments.
The 18-byte ``40 02`` payload structure (CONFIRMED across all 4
fixtures by inspecting the increment of bytes [8:12]):
| Offset | Length | Field |
|--------|--------|--------------------------------------------------|
| 0 | 4 | Anchor / channel state (open — see below) |
| 4 | 4 | Variable field (open) |
| 8 | 4 | uint32 LE counter — increments by 1 per segment |
| 12 | 4 | Fixed pattern ``02 00 00 01`` |
| 16 | 2 | Variable tail |
The counter at bytes [8:12] starts in the 0x40s for a freshly-erased
device and increments cleanly — useful as a structural sanity check.
Examples from event-c (1 sec single-shot):
```
db[7:]: [11-byte header] [21-byte STRT record] [6-byte preamble] [waveform ...]
STRT: offset 11 in db[7:]
+0..3 b'STRT' magic
+8..9 uint16 BE total_samples (full-record expected sample-set count)
+16..17 uint16 BE pretrig_samples (pre-trigger window, in sample-sets)
+18 uint8 rectime_seconds
preamble: +19..20 0x00 0x00 null padding
+21..24 0xFF × 4 synchronisation sentinel
Waveform: starts at strt_pos + 27 within db[7:]
Segment header 1 (offset 235):
40 02 | 00 00 00 00 | 0a 4b 01 1e | 47 00 00 00 | 02 00 00 01 | 00 01
^counter=0x47
Segment header 2 (offset 523):
40 02 | ff fe ff fe | 13 f5 01 06 | 48 00 00 00 | 02 00 00 01 | 00 02
^counter=0x48 (+1)
```
**A5[1..N] frame layout (non-metadata frames):**
##### Trailer
```
db[7:]: [8-byte per-frame header] [waveform ...]
Header: [counter LE uint16, 0x00 × 6] — frame sequence counter (0, 8, 12, 16, 20, …×0x400)
Waveform: starts at byte 8 of db[7:]
```
The trailer (after the last segment's data) is a sequence of 32-byte
``30 08`` blocks plus a final ``30 04`` / ``20 04`` / ``40 02`` summary
ending in the constant 2-byte tail ``00 1A``. These contain
per-channel statistics (peak times, peak values, mean offsets — bytes
in the form ``f3/f4/f5`` near ``20 10`` markers strongly resemble
int8 channel-bias values around -12). Detailed decoding of the
trailer is outside the path needed for sample reconstruction.
**Special frames:**
##### What's still open
| Frame index | Contents |
- **The byte → sample mapping inside ``10 NN`` and ``20 NN`` blocks.**
Tested hypotheses that did not match BW's ASCII export to within ±1
ADC count:
1. ``10 NN`` data = 4-bit signed nibble deltas, channel-interleaved,
all 24 channel permutations × 2 nibble orders × 2 sign conventions
× 2 init-from-header settings (= 96 combinations). All produce
values that diverge from truth after the first ~7 sample-sets.
2. ``20 NN`` data = int8 absolute or delta samples for one channel.
Magnitudes in observed blocks (peak ±34 in event-c at offset 351)
do not match any channel's PPV at any plausible ADC quantization
(1-count, 4-count, 8-count, 16-count).
3. ``00 NN`` marker = "skip N sample-sets with zero deltas". Sums
of NN/4 across markers do not consistently match the 80
sample-sets-per-segment count.
The codec is more elaborate than uniform 4-bit deltas. A hybrid
variable-bit-width scheme (4-bit deltas in ``10 NN``, 8-bit deltas
or absolutes in ``20 NN``, segment-header anchors after each
``40 02``) is the most plausible remaining hypothesis.
- **The role of byte [4:9] of the preamble.** Byte 4 == Tran[0]
truth value (in 16-count units) for events a/b/d, but doesn't
fit consistently for event-c. Bytes [5:9] don't match a simple
per-channel encoding.
- **Walker correctness past offset ~427 in event-b.** The walker
bails out partway through event-b — there is at least one block
whose length doesn't fit the lengths confirmed for the other
three events. Likely a ``20 NN`` with NN > 0xFC (currently
rejected by the walker), or a different length formula in some
context.
##### Recommended next step
A capture with a known external waveform (calibration tone of known
frequency and amplitude) would unlock the magnitude scaling and
disambiguate which channel a ``20 NN`` block belongs to. Multiple
captures of the same signal at different ``geo_range`` settings
(Normal 10 in/s vs Sensitive 1.25 in/s) would also pin down whether
sample values are scaled at the codec layer or only at the BW
display layer.
##### Reference module
``minimateplus/waveform_codec.py`` implements the verified block
walker (:func:`walk_body`, :func:`split_segments`,
:func:`parse_segment_header`). ``decode_waveform_v2`` is a stub that
returns ``None`` until a verified per-byte sample decoder is wired
up; production code (``minimateplus/client.py``) continues to use
the legacy int16 LE decoder, which produces wrong samples but stable
output shape — keep the ``.h5`` sidecars marked as
"sample-codec unverified" until the byte-to-sample mapping lands.
##### History (do not re-derive)
| Date | Note |
|---|---|
| A5[0] | Probe response: STRT record + first waveform chunk |
| A5[7] | Event-time metadata strings only (no waveform data) |
| A5[9] | Terminator frame (page_key=0x0000) — ignored |
| A5[1..6,8] | Waveform chunks |
**Confirmed from 4-2-26 blast capture (total_samples=9306, pretrig=298, rate=1024 Hz):**
```
Frame Waveform bytes Cumulative Align(mod 8)
A5[0] 933B 933B 0
A5[1] 963B 1896B 5
A5[2] 946B 2842B 0
A5[3] 960B 3802B 2
A5[4] 952B 4754B 2
A5[5] 946B 5700B 2
A5[6] 941B 6641B 4
A5[8] 992B 7633B 1
Total: 7633B → 954 naive sample-sets, 948 alignment-corrected
```
Only 948 of 9306 sample-sets captured (10%) — `stop_after_metadata=True` terminated
download after A5[7] was received.
**Channel identification note:** Channel ordering [Tran, Vert, Long, Mic] = [ch0, ch1, ch2, ch3]
is the Blastware convention. This ordering has not been independently verified end-to-end,
since no decoder yet produces samples that match BW's own rendering of the same event (see
the retraction at the top of §7.6.1). Once the body codec is decoded, the per-channel PPV
values from the 0C record (Tran=0.420, Vert=3.870, Long=0.495 in/s for the 4-2-26 capture)
provide the cross-check that pins down channel order.
> **Historical note:** earlier revisions of this section claimed the 4-2-26 blast had
> "saturated all four channels to ~3200032617 counts," citing that as evidence the s16 LE
> interpretation was correct. That claim was wrong — the ±32K values were the broken
> decoder's output, not the actual signal amplitude (which the 0C peaks above show was
> nowhere near saturation). Retracted 2026-05-08.
| 2026-05-08 | Block tagging confirmed against the 4-event May 2026 bundle. All bodies parse cleanly through `walk_body` for events a/c/d. Event-b walks partway and stops at offset 427 (open issue). |
| 2026-05-08 | Earlier "4-channel interleaved s16 LE" claim formally retracted — never validated, produced full-scale ±32K noise on every event because the bytes are encoded, not raw samples. |
| 2026-04-02 | "Frame 7 metadata", "Frame 9 terminator", and `0x0400`-step chunk-counter claims documented as-was; later proved to be artifacts of an over-reading 5A walk (now superseded by §7.8.57.8.7). |
---