seismo-relay/minimateplus/waveform_codec.py

"""
waveform_codec.py — block-walker and partial decoder for the MiniMate Plus
waveform-file body.

PARTIAL REVERSE-ENGINEERING — last updated 2026-05-11.

The Blastware waveform-file body — the bytes between the 21-byte STRT
record and the 26-byte file footer — is NOT raw int16 LE samples (the
historical assumption that produced full-scale ±32K noise on every
event).  It is a tagged variable-length block stream with a custom
delta + RLE codec.

Current status:

- Block framing: ✅ solved (block types and lengths all confirmed)
- Tran channel, segment 0: ✅ solved (decode_tran_initial returns
  byte-exact values vs BW's ASCII export, across 5 of 5 loud-bundle
  events; first ~510 samples per event)
- Multi-segment Tran continuation: ❌ open (every hypothesis breaks
  at the segment-1 boundary around sample 512)
- Vert / Long / Mic channel decoders: ❌ open
- 30 NN block content: ❌ open (only appears in loud-from-start events)

Production code in client.py still uses the broken int16 LE decoder.
``decode_waveform_v2`` here returns ``None`` as a placeholder.  Callers
that need sample arrays should treat the legacy decoder's output as
"unverified" — the BW binary write path is the only sample-bearing
output that is currently trustworthy.

────────────────────────────────────────────────────────────────────────────
Body layout (CONFIRMED 2026-05-11 against 8 fixture events)
────────────────────────────────────────────────────────────────────────────

    [7-byte preamble] [stream of tagged blocks] [trailer]

The preamble is always exactly 7 bytes:

    body[0:3]  = 00 02 00              magic
    body[3:5]  = Tran[0]   int16 BE    in 16-count units (LSB = 0.005 in/s)
    body[5:7]  = Tran[1]   int16 BE    in 16-count units

(Earlier drafts of this module described a "7-or-9-byte preamble";
that was wrong — single-shot and continuous events both use 7 bytes.
The "extra 2 bytes" on continuous events were the first ``00 NN`` RLE
marker, not part of the preamble.)

Block types and lengths (all confirmed):

| Tag      | Length                | Meaning                                |
|----------|-----------------------|----------------------------------------|
| ``10 NN``| NN/2 + 2 bytes        | 4-bit nibble deltas (2 per byte; high  |
|          |                       | nibble first; signed 0..7 / 8..F = -8..-1)|
| ``20 NN``| NN + 2 bytes          | int8 signed deltas (1 per byte)        |
| ``00 NN``| 2 bytes               | RLE: append NN copies of current value |
| ``30 NN``| NN*2 in data, NN*4    | Unknown content.  Only in loud events. |
|          | in trailer            |                                        |
| ``40 02``| 20 bytes (fixed)      | Segment header                         |

NN is always a multiple of 4.

────────────────────────────────────────────────────────────────────────────
Tran channel, segment 0 (CONFIRMED 2026-05-11)
────────────────────────────────────────────────────────────────────────────

Segment 0 — everything before the first ``40 02`` segment header — encodes
Tran samples only.  Starting from preamble anchors Tran[0] and Tran[1],
each subsequent block contributes to the running Tran value:

    10 NN  →  append NN deltas (4-bit signed nibbles)
    20 NN  →  append NN deltas (int8 signed bytes)
    00 NN  →  append NN copies of the current value (RLE zeros)
    40 02  →  segment 0 ends; multi-segment continuation is open

This decodes the first 482–510 samples of Tran for each event with zero
errors against BW's ASCII export.  The exact segment-0 sample count
varies per event (it's bounded by a fixed device-flash byte budget, not
a fixed sample count — quiet events fit more samples because zero
deltas pack into ``00 NN`` markers compactly).

Implementation: :func:`decode_tran_initial`.

────────────────────────────────────────────────────────────────────────────
Segment header (40 02, 20 bytes total)
────────────────────────────────────────────────────────────────────────────

The 18-byte payload of the ``40 02`` block:

| Offset    | Field                                       | Status      |
|-----------|---------------------------------------------|-------------|
| [0:2]     | T_delta at first sample of new segment      | ✅ confirmed|
|           | (int16 BE, in 16-count units)               |             |
| [2:4]     | Likely T_delta at sample seg_start+1        | 🟡 likely   |
| [4:6]     | Unknown (varies; possibly checksum)         | ❓ open     |
| [6:8]     | Byte length to next segment header − 2      | ✅ confirmed|
|           | (uint16 BE; useful for walker pre-scan)     |             |
| [8:12]    | Monotonic uint32 LE counter                 | ✅ confirmed|
|           | (starts ~0x47, increments by 1 per segment) |             |
| [12:14]   | Constant ``02 00``                          | ✅ confirmed|
| [14:18]   | Unknown 4-byte field                        | ❓ open     |

────────────────────────────────────────────────────────────────────────────
What breaks the multi-segment decoder (the main open question)
────────────────────────────────────────────────────────────────────────────

After segment 0 ends and the segment header T_delta is consumed,
applying segment 1's blocks as Tran continuation produces values that
diverge from truth by sample ~512.  The block structure inside segment
1 is IDENTICAL to segment 0 (same alternating 10 NN / 00 NN pattern),
and the delta budget matches the segment size exactly (V70 segment 1
has 264 nibble-deltas + 244 RLE zeros = 508 = the segment's sample
count).  But the cumulative is wrong.

The strongest unverified hypothesis is that segments rotate channels:

    segment 0  →  Tran samples 0..509
    segment 1  →  Vert samples 0..507
    segment 2  →  Long samples 0..507
    segment 3  →  Mic  samples 0..507
    segment 4  →  Tran samples 510..N (continuation)
    ...

This is consistent with the segment-1 block sums net-to-near-zero in
V70 (where all 4 channels are near zero) and with the per-segment delta
budget matching the segment size for a single channel.  It is NOT yet
verified because the per-segment channel anchor isn't pinned down in
the segment header — bytes [4:6] and [14:18] of the header are still
open and probably encode V/L/M anchors.

See ``docs/waveform_codec_re_status.md`` for the current working notes
and the suggested next experiment ("segment-channel scoring analyzer").
"""

from __future__ import annotations

from dataclasses import dataclass
from typing import List, Optional, Tuple


@dataclass
class WaveformBlock:
    """One tagged block parsed out of a Blastware waveform-file body."""
    offset: int      # byte offset into body
    tag_hi: int      # first tag byte (0x10 / 0x20 / 0x00 / 0x30 / 0x40)
    tag_lo: int      # second tag byte (NN)
    data: bytes      # block payload (excludes the 2-byte tag)
    length: int      # total block length on the wire (includes the tag)

    @property
    def kind(self) -> str:
        return f"{self.tag_hi:02x} {self.tag_lo:02x}"


def find_data_start(body: bytes) -> int:
    """Auto-detect the offset of the first data block.

    The body starts with a 7-byte preamble (magic ``00 02 00`` + two int16 BE
    Tran anchors).  After that, the data section starts with a tag — usually
    ``10 NN`` or ``20 NN``, but quiet events may begin with a ``00 NN`` RLE
    marker.  We return the offset of the first recognized tag.
    """
    # Try fixed offset 7 first (canonical preamble length).
    if len(body) >= 9:
        b, nn = body[7], body[8]
        if (b in (0x00, 0x10, 0x20, 0x30) and nn % 4 == 0 and 0 < nn <= 0xFC) \
                or (b == 0x40 and nn == 0x02):
            return 7
    # Fall back to scanning the first 20 bytes.
    for i in range(min(20, len(body) - 1)):
        b = body[i]
        nn = body[i + 1]
        if b in (0x10, 0x20) and nn % 4 == 0 and 0 < nn <= 0xFC:
            return i
    return -1


def walk_body(body: bytes, start: Optional[int] = None) -> List[WaveformBlock]:
    """Walk the tagged-block sequence starting at *start* (auto-detected by default).

    Stops when an unrecognized tag is encountered or end of body is reached.
    Returned blocks are in stream order.
    """
    if start is None:
        start = find_data_start(body)
        if start < 0:
            return []

    blocks: List[WaveformBlock] = []
    i = start
    while i + 1 < len(body):
        t0 = body[i]
        t1 = body[i + 1]
        if t0 == 0x10 and t1 % 4 == 0 and 0 < t1 <= 0xFC:
            length = t1 // 2 + 2
        elif t0 == 0x20 and t1 % 4 == 0 and 0 < t1 <= 0xFC:
            length = t1 + 2
        elif t0 == 0x00 and t1 % 4 == 0:
            length = 2
        elif t0 == 0x30 and t1 % 4 == 0 and 0 < t1 <= 0x10:
            # Data-section ``30 NN`` blocks have length NN*2 (= 8 for NN=4,
            # confirmed in M529LL1A.SS0 at body offset 29).  Trailer-section
            # ``30 NN`` blocks have length NN*4 (= 32 for NN=8, confirmed in
            # event-d trailer at body offset 3941).  We pick NN*2 if it lands
            # on a recognized tag, otherwise fall through to NN*4.
            cand2 = t1 * 2
            cand4 = t1 * 4
            if (i + cand2 < len(body) - 1
                    and body[i + cand2] in (0x10, 0x20, 0x00, 0x30, 0x40)):
                length = cand2
            else:
                length = cand4
        elif t0 == 0x40 and t1 == 0x02:
            length = 20
        else:
            # Unknown tag; stop.  Caller can inspect ``i`` to see where.
            break

        if i + length > len(body):
            break

        data = bytes(body[i + 2 : i + length])
        blocks.append(WaveformBlock(offset=i, tag_hi=t0, tag_lo=t1, data=data, length=length))
        i += length

    return blocks


def split_segments(blocks: List[WaveformBlock]) -> List[List[WaveformBlock]]:
    """Group consecutive blocks into segments separated by ``40 02`` headers.

    The first segment is whatever runs before the first ``40 02`` header
    (typically the "segment 0" preamble data after the body preamble).
    Subsequent segments start with a ``40 02`` block, then have their
    own data blocks until the next ``40 02``.
    """
    segments: List[List[WaveformBlock]] = []
    current: List[WaveformBlock] = []
    for b in blocks:
        if b.tag_hi == 0x40 and b.tag_lo == 0x02:
            if current:
                segments.append(current)
            current = [b]
        else:
            current.append(b)
    if current:
        segments.append(current)
    return segments


def parse_segment_header(block: WaveformBlock) -> Optional[dict]:
    """Decode the 18-byte payload of a ``40 02`` segment header.

    Returns a dict with the labelled fields, or None if *block* is not
    a ``40 02`` header.
    """
    if not (block.tag_hi == 0x40 and block.tag_lo == 0x02):
        return None
    if len(block.data) < 18:
        return None
    p = block.data
    counter = int.from_bytes(p[8:12], "little", signed=False)
    return {
        "anchor_bytes": p[0:4],          # 4-byte field, role unconfirmed
        "field2": p[4:8],                # 4-byte field, role unconfirmed
        "counter": counter,              # uint32 LE — increments by 1 per segment
        "fixed_pattern": p[12:16],       # always b"\x02\x00\x00\x01"
        "tail": p[16:18],                # last 2 bytes
    }


def _s4(n: int) -> int:
    """Sign-extend a 4-bit value to signed int (0..7 → 0..7; 8..F → -8..-1)."""
    return n if n < 8 else n - 16


def _i8(b: int) -> int:
    """Reinterpret an unsigned byte as signed int8."""
    return b if b < 128 else b - 256


def decode_tran_initial(body: bytes) -> Optional[List[int]]:
    """
    Decode the initial Tran-channel samples — VERIFIED 2026-05-11.

    Returns Tran samples in **16-count units** (LSB = 0.005 in/s at Normal
    range — the same quantization BW uses for its ASCII export).  Returns
    ``None`` if the body cannot be parsed.

    The decoded list extends from sample 0 through the end of segment 0
    (= just before the first ``40 02`` segment header; ~510 sample-sets
    for the events tested).  Multi-segment decoding requires continuing
    past the segment header — that's done by :func:`decode_tran_full`
    when the per-segment rules are pinned down for all signal types.

    Codec for segment 0 (CONFIRMED 2026-05-11 against 7 fixture events):

    - Body bytes [0:3] are the magic ``00 02 00``.
    - Body bytes [3:5] = ``Tran[0]`` as int16 BE in 16-count units.
    - Body bytes [5:7] = ``Tran[1]`` as int16 BE in 16-count units.
    - Data blocks (``10 NN`` or ``20 NN``) carry Tran deltas starting
      at sample 2:

      * ``10 NN``: NN nibbles = NN/2 bytes; each nibble is a 4-bit
        signed delta (0..7 → 0..+7; 8..F → -8..-1).  High nibble of
        each byte comes first.
      * ``20 NN``: NN int8 signed deltas (one delta per byte).

    - ``00 NN`` blocks are run-length-encoded zero deltas: append NN
      copies of the current cumulative Tran value (no change).

    - ``30 NN`` blocks have not yet been decoded for content — they
      appear in segment 0 of loud-from-start events (SS0, SV0) and
      seem to signal a transition or special-case interpretation.
      The walker steps over them but their data is ignored.

    The walk stops at the first ``40 02`` segment header.
    """
    if len(body) < 7 or body[0:3] != b"\x00\x02\x00":
        return None
    t0 = int.from_bytes(body[3:5], "big", signed=True)
    t1 = int.from_bytes(body[5:7], "big", signed=True)

    start = find_data_start(body)
    if start < 0:
        return [t0, t1]

    out = [t0, t1]
    cur = t1
    for blk in walk_body(body, start):
        if blk.tag_hi == 0x40:
            # Segment boundary — stop.  Multi-segment decode is decode_tran_full.
            break
        if blk.tag_hi == 0x10:
            for byte in blk.data:
                for nib in ((byte >> 4) & 0xF, byte & 0xF):
                    cur += _s4(nib)
                    out.append(cur)
        elif blk.tag_hi == 0x20:
            for byte in blk.data:
                cur += _i8(byte)
                out.append(cur)
        elif blk.tag_hi == 0x00:
            # RLE zero deltas: append NN copies of current Tran value.
            for _ in range(blk.tag_lo):
                out.append(cur)
        # 30 NN: unknown content; skip.
    return out


def decode_waveform_v2(body: bytes) -> Optional[dict]:
    """
    Decode the body into per-channel sample arrays.

    Returns ``None`` because the full multi-channel decoder is not yet
    wired up.  Tran is partially solved — see :func:`decode_tran_initial`
    for the initial portion (verified against ground-truth BW exports).

    Status (2026-05-11):
    - Tran[0:N] correctly decoded by ``decode_tran_initial`` for the
      first N samples of every fixture (where N = 22 / 42 / 46
      depending on event).
    - Subsequent Tran samples + all Vert / Long / MicL samples: open.
      The block stream after the first data block likely interleaves
      channels with ``30 NN`` channel-switch markers, but the exact
      switching rule is still under investigation.
    """
    return None