Files
seismo-relay/docs/histogram_codec_re_status.md
T
serversdown c3c7fe559c docs: histogram body codec RE — starting-point status doc
Captures everything learned in the 2026-05-20 session before scope
forced a pause:

  - Block framing is solved: 32-byte blocks, one per histogram
    interval, signature byte pattern `[22:24]=0x0000` +
    `[28:32]=0x1e 0x0a 0x00 0x00` reliably identifies data blocks.
  - Block count = interval count (791 blocks in N844L20G.630H for
    a TXT-reported 792 intervals).
  - Sample[0] = Tran peak in 0.0005 in/s/count units (verified on
    one event — needs cross-event confirmation).
  - Samples 1-8 → channel/metric mapping is still open.  None of
    the obvious layouts (peak-then-freq alternating, all-peaks-
    then-all-freqs, per-channel 3-tuples) match the TXT values
    across multiple blocks.  Likely needs a higher-activity
    fixture (current N844 corpus is all noise-floor data) to
    disambiguate.
  - `>100 Hz` sentinel encoding in the binary is unknown.
  - 4-byte variable metadata field at block[24:28] needs
    correlation work against TXT columns.

Doc mirrors the structure of docs/waveform_codec_re_status.md so
a future RE session has a familiar entry point.  Includes the
suggested attack plan + the code seam where the eventual decoder
will land (minimateplus/histogram_codec.py).

The §7.6.2 spec in instantel_protocol_reference.md is structurally
correct but doesn't pin down per-sample semantics — this doc
supersedes it where they conflict on confidence level.

No code shipped on this branch.  When the codec is cracked, the
plan is to land minimateplus/histogram_codec.py + wire into
event_file_io.read_blastware_file() + remove the has_samples
short-circuit from scripts/backfill_sidecars.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 21:13:26 +00:00

213 lines
7.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Histogram body codec — IN PROGRESS (started 2026-05-20)
Working notes for the Series III histogram-mode event body codec
reverse-engineering effort. Mirrors the structure of
`waveform_codec_re_status.md` (the now-completed waveform codec). The
historical context lives in `docs/instantel_protocol_reference.md
§7.6.2`; this doc is the active scratchpad.
## TL;DR (current state)
**Block framing is solved. Sample-to-channel mapping is open.**
| Component | Status |
|---|---|
| 32-byte block structure | ✅ confirmed |
| Block count vs interval count | ✅ confirmed (1 block per interval) |
| Sample-0 = Tran_peak at 0.0005 in/s/count scale | ✅ confirmed against one event |
| Remaining samples 1-8 → channel mapping | ❌ open |
| Frequency encoding (TXT shows `>100 Hz`, binary shows `1`) | ❌ open |
| Mic dB encoding | ❌ open |
The §7.6.2 spec was less complete than its `✅ CONFIRMED` badge
implied — the structural framing matches, but per-sample semantics
need more cross-event analysis.
## Confirmed structure (2026-05-20)
### Body layout
```
body = [stream of 32-byte blocks]
```
Body length isn't always a multiple of 32 — observed 1-byte and
9-byte trailing remnants. Walker should iterate 32-stride and stop
before the tail.
### 32-byte block header
```
[0] 0x00 always-zero (probably a fixed format tag)
[1] segment_id (uint8) 0x00, 0x01, 0x02, 0x03 — 256 blocks per segment
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, ...)
[4:22] 9× int16 LE samples
[22:24] 0x00 0x00 constant
[24:28] 4-byte variable unknown — possibly timestamp delta or CRC
[28:30] 0x1e 0x0a constant signature (`30, 10`)
[30:32] 0x00 0x00 constant
```
Anchor for finding data blocks during a body walk: `block[22:24] ==
b"\x00\x00"` AND `block[28:32] == b"\x1e\x0a\x00\x00"`. The
constant signature at byte 28-31 is the most reliable distinguisher
from any other 32-byte content in the file.
### Block count = interval count
Confirmed against `example-events/histogram/N844L20G.630H`:
- TXT reports `Number of Intervals : 792.00`
- Binary contains 791 data blocks (one per interval, off-by-one at
the tail — probably the last interval is truncated mid-write at
recording stop)
Implication: each block represents exactly one histogram interval
(1 minute in this fixture, configurable per device). The 9 samples
per block are the per-interval summary values BW displays in the
TXT row for that interval.
### What sample 0 means
Confirmed: `sample[0] / 2000 = Tran peak amplitude in in/s` for
the Normal-range geophone. Equivalently, sample[0] is in units of
**0.0005 in/s per count** (NOT the 0.005 in/s display quantum or the
1-count ADC quantum).
Verified for block 0 of N844L20G.630H:
- binary sample[0] = 10
- TXT Tran_peak[0] = 0.005 in/s
- check: 10 × 0.0005 = 0.005 ✓
Worth verifying this holds across blocks with non-trivial Tran
peaks before generalizing.
## Open mappings
### Samples 1-8 → channel + metric
TXT structure is **10 columns per interval**:
```
Tran Tran Vert Vert Long Long Geo MicL MicL MicL
Peak Freq Peak Freq Peak Freq PVS psi dB(L) Freq
in/s Hz in/s Hz in/s Hz in/s psi dB Hz
```
Binary has **9 samples per block** (one short of the column count).
None of the obvious mappings work:
| Hypothesis | Why it fails |
|---|---|
| (T_peak, T_freq, V_peak, V_freq, L_peak, L_freq, Geo, M_peak, M_freq) | Sample[1]=1 doesn't decode to `>100 Hz` under any obvious scale |
| (T_peak, V_peak, L_peak, T_freq, V_freq, L_freq, Geo, M_peak, M_freq) | V_peak should be 1 → 0.005 in/s but is 1 → would compute 0.0005, TXT shows 0.005 for some intervals, 0.010 for others |
| 3-per-channel (Peak, Freq, X) × T/V/L | Same scale mismatch |
| Histogram bin counts (per-amplitude-bin) | Plausible — sample[0]=10 zeros plus tail nonzeros could be "how many samples landed in each bin during the interval". But then sample[0] = T_peak coincidence is suspicious. |
`>100 Hz` is a sentinel BW writes when the measured zero-crossing
frequency exceeds the geophone's measurement range. The binary
encoding of this sentinel is unknown. Common candidates:
- Special value (e.g. 0xFFFF / 0x7FFF / 0)
- A flag bit in the metadata bytes (especially the 4-byte variable
field at [24:28])
### Metadata 4-byte variable field (bytes 24:28)
Examples from the first 8 blocks of N844L20G.630H:
```
block 0: 03 90 2a 00
block 1: 04 f2 84 00
block 2: 03 2b e7 00
block 3: 03 fe 11 00
block 4: 03 f7 91 00
block 5: 03 e9 4e 00
block 6: 03 4c 5c 00
block 7: 03 99 aa 00
```
First byte is mostly `0x03` (blocks 0,2-7) and sometimes `0x04` (block
1). Could be a CRC, timestamp delta, or per-interval status byte.
Worth correlating against TXT columns that vary block-to-block.
## Fixture corpus
In-repo histogram fixtures (paired binary + ASCII TXT):
```
example-events/histogram/N844L20G.630H (27 KB, 791 blocks, 792 intervals)
example-events/histogram/N844L21H.2R0H (22 KB)
example-events/histogram/N844L22A.VT0H (27 KB)
example-events/histogram/N844L23B.ND0H ...
example-events/histogram/N844L27U.U30H ...
example-events/histogram/N844L28V.NA0H ...
example-events/histogram/N844L6QT.IQ0H ...
example-events/histogram/N844L6RU.BO0H ...
example-events/histogram/N844L6SO.6I0H ...
example-events/histogram/N844L6TP.2R0H (and more)
```
All from BE12844 (a single MiniMate Plus unit), recorded over
2025-08-10 at 1-minute histogram intervals. All "noise floor"
events — mostly silent intervals with rare spikes.
Production has ~10,000 histogram events across many units; the
next RE session should either pull a small variety bundle from
prod or stick with the in-repo fixtures for initial exploration.
## Suggested attack plan for next session
1. **Verify sample[0] = T_peak hypothesis across all 791 blocks
of N844L20G.630H** — confirms the scale factor isn't a coincidence.
2. **Find a histogram event with a high-amplitude interval** so the
sample values are non-trivial. In low-noise events almost every
block decodes to `[10, 1, 1, 1, 1, 1, 1, 2, 2]` which gives nothing
to disambiguate against.
3. **Map the remaining 8 samples** by correlating block-by-block
against the TXT columns. Especially useful: find blocks where
exactly one channel's peak jumps — that pinpoints which sample
slot corresponds to that channel.
4. **Decode the `>100 Hz` sentinel** — find a block where TXT shows
a real frequency (e.g. `73.1 Hz`) and reverse the binary value.
5. **Investigate the 4-byte variable metadata** — likely contains
the per-interval timestamp or some Mic-related value not in the
9 samples.
6. **Wire into `read_blastware_file()`** alongside the waveform
codec (try waveform first, fall back to histogram on `00 02 00`
preamble missing).
7. **Update `scripts/backfill_sidecars.py`** to remove the
`has_samples` short-circuit so histogram `.h5` files regenerate
too.
## Code seam for the eventual decoder
`minimateplus/histogram_codec.py` (to-be-created) should mirror
`minimateplus/waveform_codec.py`:
```python
def decode_histogram_body(body: bytes) -> Optional[dict]:
"""Decode a histogram-mode body into per-channel sample arrays.
Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}``
with each channel's per-interval peak values in ADC counts.
Returns ``None`` if the body cannot be parsed.
"""
```
Then in `event_file_io.read_blastware_file()`:
```python
decoded = decode_waveform_v2(body)
if decoded is None:
decoded = decode_histogram_body(body)
if decoded is None:
log.warning(...)
samples = {"Tran": [], ...}
else:
samples = decoded_to_adc_counts(decoded)
```
## Related work
- Waveform body codec — `docs/waveform_codec_re_status.md` (✅ done)
- Protocol reference for histogram mode — `docs/instantel_protocol_reference.md §7.6.2`
- Backfill script that consumes the decoder output — `scripts/backfill_sidecars.py`