Files
seismo-relay/docs/waveform_codec_re_status.md
T
Claude ce5dc640ba codec-re: quiet bundle decodes FULLY (17k samples, zero errors)
User asked the right question: do events without 30 NN blocks decode
fully?  Answer: YES.

  event-a:  Tran 3328 ✓  Vert 3328 ✓  Long 3328 ✓  (28 segments, 0 '30 NN')
  event-c:  Tran 1280 ✓  Vert 1280 ✓  Long 1280 ✓  (12 segments, 0 '30 NN')
  event-d:  Tran 1280 ✓  Vert 1280 ✓  Long 1280 ✓  (12 segments, 0 '30 NN')

17,664 ADC samples decoded byte-exact against BW's ASCII export.
Zero divergences across event-a, event-c, event-d.

This means the codec is FULLY SOLVED for any event without 30 NN
blocks.  The remaining gap is the 30 NN block format only — used for
high-amplitude regions where deltas exceed int8 range.  For quiet
events (or quiet stretches of loud events), the decoder is complete.

9 new regression tests bring the total to 55, all passing.

Files: tests/test_waveform_codec.py + docs/waveform_codec_re_status.md
+ new analysis/verify_quiet_bundle.py.
2026-05-20 17:28:54 +00:00

225 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Waveform body codec — current working status (2026-05-11, late)
This is the **clean working note** for the body-codec reverse-engineering
effort. It supersedes scattered claims elsewhere when they conflict.
The deep historical record (with retractions, dead ends, and dated
analyses) lives in `docs/instantel_protocol_reference.md §7.6.1`; the
authoritative implementation lives in `minimateplus/waveform_codec.py`.
## TL;DR
The Blastware waveform-file body is a **tagged variable-length block
stream**, NOT raw int16 LE samples. Block framing is solved. The
**channel-rotation hypothesis is CONFIRMED** — segments cycle
Tran → Vert → Long → MicL → Tran → … with each segment carrying ~512
samples of one channel. Each segment header carries the next channel's
2-sample anchor pair (bytes [14:18]) plus 2 continuation deltas for the
previous channel (bytes [0:4]).
**What decodes byte-exact today (verified against BW ASCII export):**
**Quiet events with zero `30 NN` blocks — decode FULLY across all channels:**
| Event | Channel | Samples verified | `30 NN` blocks |
|---|---|---|---|
| **event-a** (5-8-26) | Tran / Vert / Long | **3328 each × 3 = 9984** | 0 |
| **event-c** (5-8-26) | Tran / Vert / Long | **1280 each × 3 = 3840** | 0 |
| **event-d** (5-8-26) | Tran / Vert / Long | **1280 each × 3 = 3840** | 0 |
That's **17,664 ADC samples decoded byte-exact, zero errors**.
**Loud events with `30 NN` blocks — decode up to the first `30 NN`:**
| Event | Channel | Samples verified |
|---|---|---|
| V70 (Mic-heavy) | Tran / Vert / Long | 512 each (1 segment) |
| JQ0 (Vert-heavy) | Tran | 512 |
| JQ0 | Vert | 258 |
| SP0 (loud all) | Long | **1536 (all 3 L segments)** |
| SP0 | Tran | 1350 (diverges at first `30 NN`) |
| SP0 | Vert | 650 (diverges at first `30 NN`) |
**What's still open — ONLY the `30 NN` block format.** These blocks
appear in high-amplitude regions (deltas exceeding what int8 can
express). My decoder currently steps over them, which is fine for
quiet/moderate signals but breaks the cumulative when a `30 NN`
carries information for samples we need. **Quiet events without
`30 NN` decode 100% correctly across all channels.** Cracking
`30 NN` is the last piece.
**Production code in `minimateplus/client.py:_decode_a5_waveform` still
uses the broken legacy int16 LE decoder.** Sample arrays it writes to
the `.h5` sidecars are wrong and must be treated as "unverified" by all
downstream consumers. The BW binary write path (`blastware_file.py`)
is unaffected — it's pure passthrough and remains byte-perfect.
## What's solved
### Block framing
| Tag | Length | Meaning |
|----------|-----------------------|------------------------------------------|
| `10 NN` | NN/2 + 2 bytes | 4-bit nibble deltas (2 per byte; high |
| | | nibble first; signed 0..7 / 8..F = -8..-1)|
| `20 NN` | NN + 2 bytes | int8 signed deltas (1 per byte) |
| `00 NN` | 2 bytes | RLE: append NN copies of current value |
| `30 NN` | NN*2 in data section, | Unknown content. Only in loud-from- |
| | NN*4 in trailer | start events. |
| `40 02` | 20 bytes (fixed) | Segment header |
NN is always a multiple of 4.
Implementation: `walk_body()` in `minimateplus/waveform_codec.py`.
### 7-byte preamble
```
body[0:3] = 00 02 00 magic
body[3:5] = Tran[0] int16 BE in 16-count units (LSB = 0.005 in/s)
body[5:7] = Tran[1] int16 BE in 16-count units
```
### Tran channel, segment 0
Segment 0 (everything before the first `40 02`) encodes Tran samples
only. Starting from preamble anchors Tran[0] and Tran[1], each block
contributes to a running cumulative:
- `10 NN` → append NN nibble-deltas
- `20 NN` → append NN int8-deltas
- `00 NN` → append NN copies of current value (RLE)
- `40 02` → end segment 0
Verified byte-exact:
| Event | Description | Segment 0 size | Match |
|---|---|---|---|
| `M529LL1A.SP0` | Loud, 0.25 s pretrig | 510 | 510/510 ✓ |
| `M529LL1A.SV0` | Loud from sample 0 | 58 | 58/58 ✓ (stops at first `30 NN`) |
| `M529LL1A.SS0` | Loud from sample 0 | 42 | 42/42 ✓ (stops at first `30 04`) |
| `M529LL1L.JQ0` | Vert-heavy | 510 | 510/510 ✓ |
| `M529LL1L.V70` | Mic-heavy (140 dB) | 510 | 510/510 ✓ |
Implementation: `decode_tran_initial()`.
### Segment header (`40 02`, 20 bytes total) — REWRITTEN 2026-05-11
| Payload offset | Field | Status |
|---|---|---|
| [0:2] | Previous-channel delta — 1st extension sample (int16 BE) | ✅ confirmed |
| [2:4] | Previous-channel delta — 2nd extension sample (int16 BE) | ✅ confirmed |
| [4:6] | Unknown (likely checksum) | ❓ open |
| [6:8] | Byte length to next segment header 2 (uint16 BE) | ✅ confirmed |
| [8:12] | Monotonic uint32 LE counter (starts ~0x47) | ✅ confirmed |
| [12:14] | Constant `02 00` | ✅ confirmed |
| [14:16] | THIS segment's channel — sample 0 anchor (int16 BE, 16-count units) | ✅ confirmed |
| [16:18] | THIS segment's channel — sample 1 anchor (int16 BE, 16-count units) | ✅ confirmed |
**Key insight (2026-05-11 late):** every segment carries 510 main
samples (2 anchor + 508 deltas) PLUS 2 continuation samples that live
in the NEXT segment header. So each channel-segment effectively spans
512 sample-sets. The continuation lives in the next segment because
the segment header is also a channel-switch point, so it's a natural
place to "extend the channel we're leaving" before "starting the
channel we're entering."
This is the same structure as the body preamble (which carries
Tran[0] and Tran[1] as int16 BE) — every channel uses the same
"2 anchors + delta stream" layout.
## Channel rotation — VERIFIED 2026-05-11
```
(initial body) → Tran samples 0..509 (preamble + delta blocks)
segment 0 hdr ext+anchor → Vert samples 0..511 ← anchor in hdr [14:18]
segment 1 hdr ext+anchor → Long samples 0..511
segment 2 hdr ext+anchor → Mic samples 0..511
segment 3 hdr ext+anchor → Tran samples 510..1021 (continuation)
segment 4 hdr ext+anchor → Vert samples 512..1023
segment 5 hdr ext+anchor → Long samples 512..1023
segment 6 hdr ext+anchor → Mic samples 512..1023
segment 7 hdr ext+anchor → Tran samples 1022..1533
...
```
Implementation: `decode_waveform_v2()` returns
`{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}` with
each channel's samples in 16-count units. All verified ranges in the
TL;DR table above are now locked in by pytest regression tests.
## What's still open
1. **`30 NN` block content.** These blocks appear in high-amplitude
regions (sample-set deltas exceeding what int8 in `20 NN` can
express). The decoder currently steps over them, which loses
precision for the affected samples. Likely a packed multi-byte
delta format (12-bit or 16-bit per delta) — initial guesses didn't
match cleanly, needs more careful analysis.
2. **MicL decoding.** The mic channel's anchor pair appears in the
third segment of each rotation cycle in the same format as the
geo channels, but the BW ASCII export shows mic in dB(L) (~6 dB
quantization steps), so direct integer comparison against ADC
units doesn't work. Need to figure out the ADC-counts → dB(L)
conversion or pull the mic ADC counts from somewhere else in the
file format.
3. **Walker fix for event-b.** The original quiet bundle's event-b
still bails out partway through. Lower priority since the other
7 events walk cleanly.
## Next experiment — crack the `30 NN` block
The scoring analyzer in `scratch/next_experiment_skeleton.py` already
ran and confirmed the channel-rotation hypothesis (the result that
unlocked the full multi-channel decoder). The next open piece is the
`30 NN` block format.
Approach:
1. Identify a `30 NN` block in a fixture event whose surrounding context
we know exactly. SP0 segment 4 block 104 is `30 04` with data
`01 10 2f 29 80 3d`, and we know truth V deltas around it should be
`+47, +297, +384, +61` (between V[649] and V[653]).
2. Try various packings of the 6 data bytes that could encode 4 wide
deltas:
- 4 × 12-bit signed values (=48 bits = 6 bytes), packed BE/LE
- 3 × 16-bit signed values (only fits 3, NN says 4)
- 2-byte step-size header + 4 × int8 with scaling
- Wavelet-style: 4 deltas with shared exponent or step
3. Initial brute-force found `+47` and `+61` in positions 1 and 3 of
a 12-bit BE packing, but `+297` and `+384` didn't fit cleanly.
Worth re-trying with more permutations.
Once cracked, the `30 NN` decoder slots into `decode_waveform_v2` and
the multi-channel decode extends past the high-amplitude regions.
## Test fixtures
Committed under `tests/fixtures/`:
- `decode-re-5-8-26/event-a..event-d/`: original quiet bundle (4 events,
PPV < 1 in/s). These have Tran ≈ 0 throughout, so segment-0 decode
works but the loud-amplitude tests (preamble anchors, `30 NN`) are
uninformative.
- `5-11-26/M529LL1A.{SP0,SS0,SV0}`: loud bundle (PPV 6-7 in/s on all
channels). These cracked the Tran codec.
- `5-11-26/M529LL1L.{JQ0,V70}`: targeted captures. JQ0 is Vert-heavy,
V70 is Mic-heavy (140 dB). These cracked the `00 NN` RLE rule.
Each fixture has a `.TXT` Blastware ASCII export as ground truth.
## Tests
`tests/test_waveform_codec.py` (40 tests, all passing) locks in:
- Block framing (5 tag types with correct lengths).
- Walker contiguity (no gaps or overlaps).
- Segment header parsing (counter monotonicity, fixed-pattern check).
- `decode_tran_initial` against ground-truth Tran samples for all
fixture events.
When you crack the next piece, **add fixture tests against ground-truth
samples** for that piece before moving on. Don't let unverified code
ship without a regression lock-in.