4 Commits

Author SHA1 Message Date
serversdown c3c7fe559c docs: histogram body codec RE — starting-point status doc
Captures everything learned in the 2026-05-20 session before scope
forced a pause:

  - Block framing is solved: 32-byte blocks, one per histogram
    interval, signature byte pattern `[22:24]=0x0000` +
    `[28:32]=0x1e 0x0a 0x00 0x00` reliably identifies data blocks.
  - Block count = interval count (791 blocks in N844L20G.630H for
    a TXT-reported 792 intervals).
  - Sample[0] = Tran peak in 0.0005 in/s/count units (verified on
    one event — needs cross-event confirmation).
  - Samples 1-8 → channel/metric mapping is still open.  None of
    the obvious layouts (peak-then-freq alternating, all-peaks-
    then-all-freqs, per-channel 3-tuples) match the TXT values
    across multiple blocks.  Likely needs a higher-activity
    fixture (current N844 corpus is all noise-floor data) to
    disambiguate.
  - `>100 Hz` sentinel encoding in the binary is unknown.
  - 4-byte variable metadata field at block[24:28] needs
    correlation work against TXT columns.

Doc mirrors the structure of docs/waveform_codec_re_status.md so
a future RE session has a familiar entry point.  Includes the
suggested attack plan + the code seam where the eventual decoder
will land (minimateplus/histogram_codec.py).

The §7.6.2 spec in instantel_protocol_reference.md is structurally
correct but doesn't pin down per-sample semantics — this doc
supersedes it where they conflict on confidence level.

No code shipped on this branch.  When the codec is cracked, the
plan is to land minimateplus/histogram_codec.py + wire into
event_file_io.read_blastware_file() + remove the has_samples
short-circuit from scripts/backfill_sidecars.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 21:13:26 +00:00
serversdown fa9d3cdef2 read_blastware_file: leave peak_values=None when samples can't be decoded
Fixes a data-loss bug discovered while dry-running the backfill against
the prod store.

Symptom: every histogram event in the store has its body decoded by
read_blastware_file → codec returns None → samples = empty dict →
``ev.peak_values = _peaks_from_samples(empty)`` returns
``PeakValues(0, 0, 0, 0, 0)`` (NOT None).  The backfill script's
existing "seed from DB row when peak_values is None" branch then
correctly *skips* the seeding, and the all-zeros PeakValues flows into
``db.insert_events()``'s UPSERT path, OVERWRITING the existing good DB
peak values for that event (which were populated from the paired BW
ASCII report at ingest).

Net effect: running the backfill on prod would have wiped the PPV /
mic / vector-sum columns for ~10,000 histogram events.

Fix: only compute peaks-from-samples when there are actually samples.
For events the codec couldn't decode (histogram-mode bodies, until
the §7.6.2 histogram codec is wired in), leave peak_values=None as
the "we don't know" signal.  Downstream consumers:

  - backfill_sidecars.py — its existing ``if ev.peak_values is None:``
    branch (line 243) seeds from the DB row, preserving the real
    BW-report peaks across the regen.
  - WaveformStore.save_imported_bw — apply_report_to_event overlays
    peaks from the paired BW ASCII report when one was uploaded.
    Histogram imports without a paired report end up with NULL peaks
    in the DB, which is correct (better than zeros — clearly says
    "no peak data available" rather than "peaks are exactly zero").

Updated the existing synthetic-event round-trip test to expect
peak_values=None for the no-real-body case, which is the truth now.

The 7 fixture-corpus regression tests for real BW waveforms continue
to pass — those have decodable samples, so peak_values is still
populated from the codec output as before.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:30:53 +00:00
serversdown c4648c1959 scripts/backfill_sidecars: skip .h5 write when decoder returned no samples
Discovered while dry-running the backfill on the prod store: ~10,000
of ~10,059 events are histogram-mode (filename extension `*H`), and
the waveform-body codec wired in via the previous commit doesn't
handle histogram-mode bodies — only the waveform-mode codec at
§7.6.1 is implemented; the histogram-mode codec at §7.6.2 of the
protocol reference is documented but no Python implementation
exists yet.

Without this guard, every histogram event's .h5 file would be
*replaced* with an empty one — strictly worse than today's
broken-int16-LE .h5 because any downstream viewer expecting
non-empty sample arrays would now error out instead of just
rendering wrong values.

Fix: after the decoder runs, check whether any channel has samples.
If not, skip the .h5 write entirely.  The sidecar still regenerates
(refreshing the tool_version stamp and any peaks/project info from
the DB row), but the existing .h5 is left untouched.

This is a *temporary* gate.  When the histogram codec lands (next
branch: `feat/wire-histogram-codec`), the has_samples check can be
removed and the backfill will then correctly regenerate all .h5
files, histogram and waveform alike.

Observed effect (dry-run on prod store, 10,059 events):
  - waveform events (~5%): "[DRY ] would write … + .h5 (would (re)write)"
  - histogram events (~95%): "[DRY ] would write … + .h5 (skipped-empty-samples)"
  - sidecar tool_version bump succeeds for both

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 20:16:31 +00:00
serversdown 0e89125495 docker: fix dockerfile to include scripts and micromate folders 2026-05-20 19:58:54 +00:00
5 changed files with 259 additions and 8 deletions
+2
View File
@@ -8,8 +8,10 @@ RUN apt-get update && \
COPY pyproject.toml requirements.txt ./
COPY minimateplus ./minimateplus
COPY micromate ./micromate
COPY sfm ./sfm
COPY bridges ./bridges
COPY scripts ./scripts
RUN pip install --no-cache-dir -e .
+212
View File
@@ -0,0 +1,212 @@
# Histogram body codec — IN PROGRESS (started 2026-05-20)
Working notes for the Series III histogram-mode event body codec
reverse-engineering effort. Mirrors the structure of
`waveform_codec_re_status.md` (the now-completed waveform codec). The
historical context lives in `docs/instantel_protocol_reference.md
§7.6.2`; this doc is the active scratchpad.
## TL;DR (current state)
**Block framing is solved. Sample-to-channel mapping is open.**
| Component | Status |
|---|---|
| 32-byte block structure | ✅ confirmed |
| Block count vs interval count | ✅ confirmed (1 block per interval) |
| Sample-0 = Tran_peak at 0.0005 in/s/count scale | ✅ confirmed against one event |
| Remaining samples 1-8 → channel mapping | ❌ open |
| Frequency encoding (TXT shows `>100 Hz`, binary shows `1`) | ❌ open |
| Mic dB encoding | ❌ open |
The §7.6.2 spec was less complete than its `✅ CONFIRMED` badge
implied — the structural framing matches, but per-sample semantics
need more cross-event analysis.
## Confirmed structure (2026-05-20)
### Body layout
```
body = [stream of 32-byte blocks]
```
Body length isn't always a multiple of 32 — observed 1-byte and
9-byte trailing remnants. Walker should iterate 32-stride and stop
before the tail.
### 32-byte block header
```
[0] 0x00 always-zero (probably a fixed format tag)
[1] segment_id (uint8) 0x00, 0x01, 0x02, 0x03 — 256 blocks per segment
[2:4] block_ctr (uint16 LE) resets each segment (0x0100, 0x0101, ...)
[4:22] 9× int16 LE samples
[22:24] 0x00 0x00 constant
[24:28] 4-byte variable unknown — possibly timestamp delta or CRC
[28:30] 0x1e 0x0a constant signature (`30, 10`)
[30:32] 0x00 0x00 constant
```
Anchor for finding data blocks during a body walk: `block[22:24] ==
b"\x00\x00"` AND `block[28:32] == b"\x1e\x0a\x00\x00"`. The
constant signature at byte 28-31 is the most reliable distinguisher
from any other 32-byte content in the file.
### Block count = interval count
Confirmed against `example-events/histogram/N844L20G.630H`:
- TXT reports `Number of Intervals : 792.00`
- Binary contains 791 data blocks (one per interval, off-by-one at
the tail — probably the last interval is truncated mid-write at
recording stop)
Implication: each block represents exactly one histogram interval
(1 minute in this fixture, configurable per device). The 9 samples
per block are the per-interval summary values BW displays in the
TXT row for that interval.
### What sample 0 means
Confirmed: `sample[0] / 2000 = Tran peak amplitude in in/s` for
the Normal-range geophone. Equivalently, sample[0] is in units of
**0.0005 in/s per count** (NOT the 0.005 in/s display quantum or the
1-count ADC quantum).
Verified for block 0 of N844L20G.630H:
- binary sample[0] = 10
- TXT Tran_peak[0] = 0.005 in/s
- check: 10 × 0.0005 = 0.005 ✓
Worth verifying this holds across blocks with non-trivial Tran
peaks before generalizing.
## Open mappings
### Samples 1-8 → channel + metric
TXT structure is **10 columns per interval**:
```
Tran Tran Vert Vert Long Long Geo MicL MicL MicL
Peak Freq Peak Freq Peak Freq PVS psi dB(L) Freq
in/s Hz in/s Hz in/s Hz in/s psi dB Hz
```
Binary has **9 samples per block** (one short of the column count).
None of the obvious mappings work:
| Hypothesis | Why it fails |
|---|---|
| (T_peak, T_freq, V_peak, V_freq, L_peak, L_freq, Geo, M_peak, M_freq) | Sample[1]=1 doesn't decode to `>100 Hz` under any obvious scale |
| (T_peak, V_peak, L_peak, T_freq, V_freq, L_freq, Geo, M_peak, M_freq) | V_peak should be 1 → 0.005 in/s but is 1 → would compute 0.0005, TXT shows 0.005 for some intervals, 0.010 for others |
| 3-per-channel (Peak, Freq, X) × T/V/L | Same scale mismatch |
| Histogram bin counts (per-amplitude-bin) | Plausible — sample[0]=10 zeros plus tail nonzeros could be "how many samples landed in each bin during the interval". But then sample[0] = T_peak coincidence is suspicious. |
`>100 Hz` is a sentinel BW writes when the measured zero-crossing
frequency exceeds the geophone's measurement range. The binary
encoding of this sentinel is unknown. Common candidates:
- Special value (e.g. 0xFFFF / 0x7FFF / 0)
- A flag bit in the metadata bytes (especially the 4-byte variable
field at [24:28])
### Metadata 4-byte variable field (bytes 24:28)
Examples from the first 8 blocks of N844L20G.630H:
```
block 0: 03 90 2a 00
block 1: 04 f2 84 00
block 2: 03 2b e7 00
block 3: 03 fe 11 00
block 4: 03 f7 91 00
block 5: 03 e9 4e 00
block 6: 03 4c 5c 00
block 7: 03 99 aa 00
```
First byte is mostly `0x03` (blocks 0,2-7) and sometimes `0x04` (block
1). Could be a CRC, timestamp delta, or per-interval status byte.
Worth correlating against TXT columns that vary block-to-block.
## Fixture corpus
In-repo histogram fixtures (paired binary + ASCII TXT):
```
example-events/histogram/N844L20G.630H (27 KB, 791 blocks, 792 intervals)
example-events/histogram/N844L21H.2R0H (22 KB)
example-events/histogram/N844L22A.VT0H (27 KB)
example-events/histogram/N844L23B.ND0H ...
example-events/histogram/N844L27U.U30H ...
example-events/histogram/N844L28V.NA0H ...
example-events/histogram/N844L6QT.IQ0H ...
example-events/histogram/N844L6RU.BO0H ...
example-events/histogram/N844L6SO.6I0H ...
example-events/histogram/N844L6TP.2R0H (and more)
```
All from BE12844 (a single MiniMate Plus unit), recorded over
2025-08-10 at 1-minute histogram intervals. All "noise floor"
events — mostly silent intervals with rare spikes.
Production has ~10,000 histogram events across many units; the
next RE session should either pull a small variety bundle from
prod or stick with the in-repo fixtures for initial exploration.
## Suggested attack plan for next session
1. **Verify sample[0] = T_peak hypothesis across all 791 blocks
of N844L20G.630H** — confirms the scale factor isn't a coincidence.
2. **Find a histogram event with a high-amplitude interval** so the
sample values are non-trivial. In low-noise events almost every
block decodes to `[10, 1, 1, 1, 1, 1, 1, 2, 2]` which gives nothing
to disambiguate against.
3. **Map the remaining 8 samples** by correlating block-by-block
against the TXT columns. Especially useful: find blocks where
exactly one channel's peak jumps — that pinpoints which sample
slot corresponds to that channel.
4. **Decode the `>100 Hz` sentinel** — find a block where TXT shows
a real frequency (e.g. `73.1 Hz`) and reverse the binary value.
5. **Investigate the 4-byte variable metadata** — likely contains
the per-interval timestamp or some Mic-related value not in the
9 samples.
6. **Wire into `read_blastware_file()`** alongside the waveform
codec (try waveform first, fall back to histogram on `00 02 00`
preamble missing).
7. **Update `scripts/backfill_sidecars.py`** to remove the
`has_samples` short-circuit so histogram `.h5` files regenerate
too.
## Code seam for the eventual decoder
`minimateplus/histogram_codec.py` (to-be-created) should mirror
`minimateplus/waveform_codec.py`:
```python
def decode_histogram_body(body: bytes) -> Optional[dict]:
"""Decode a histogram-mode body into per-channel sample arrays.
Returns ``{"Tran": [...], "Vert": [...], "Long": [...], "MicL": [...]}``
with each channel's per-interval peak values in ADC counts.
Returns ``None`` if the body cannot be parsed.
"""
```
Then in `event_file_io.read_blastware_file()`:
```python
decoded = decode_waveform_v2(body)
if decoded is None:
decoded = decode_histogram_body(body)
if decoded is None:
log.warning(...)
samples = {"Tran": [], ...}
else:
samples = decoded_to_adc_counts(decoded)
```
## Related work
- Waveform body codec — `docs/waveform_codec_re_status.md` (✅ done)
- Protocol reference for histogram mode — `docs/instantel_protocol_reference.md §7.6.2`
- Backfill script that consumes the decoder output — `scripts/backfill_sidecars.py`
+12 -1
View File
@@ -811,7 +811,18 @@ def read_blastware_file(path: Union[str, Path]) -> Event:
project=project, client=client, operator=user, sensor_location=seisloc,
)
ev.raw_samples = samples
ev.peak_values = _peaks_from_samples(samples)
# Only compute peaks from samples when we actually have samples.
# For events the codec couldn't decode (histogram-mode bodies, until
# the §7.6.2 histogram codec is wired in), samples is an empty dict
# and ``_peaks_from_samples`` would return PeakValues(0, 0, 0, 0, 0).
# That would then OVERWRITE existing good DB peak values (e.g. from
# paired BW ASCII reports) during the backfill UPSERT path.
# Leaving peak_values=None signals "we don't know" to downstream
# consumers; the backfill script seeds from the DB row when it sees
# None, and ``apply_report_to_event`` overlays from a paired ASCII
# report when one is supplied.
has_samples = any(samples.get(ch) for ch in ("Tran", "Vert", "Long", "MicL"))
ev.peak_values = _peaks_from_samples(samples) if has_samples else None
ev._a5_frames = None # not recoverable from BW file
return ev
+22 -2
View File
@@ -311,12 +311,32 @@ def main(argv=None) -> int:
# int16-LE codec era — bumping TOOL_VERSION to 0.20.0+
# marks every pre-codec sidecar stale, which now
# correctly cascades to .h5 regeneration too.
#
# Skip the .h5 write when the decoder couldn't produce
# samples — this is the histogram-mode case today
# (waveform_codec.decode_waveform_v2 only handles the
# waveform-mode body format per §7.6.1; the histogram
# codec at §7.6.2 is documented but not yet implemented).
# Without this check we'd replace the existing (broken
# int16-LE) histogram .h5 with an empty one, which is
# arguably worse for any consumer expecting non-empty
# sample arrays. When the histogram codec lands, this
# check can come out.
has_samples = bool(
ev.raw_samples and any(
ev.raw_samples.get(ch) for ch in ("Tran", "Vert", "Long", "MicL")
)
)
hdf5_path = store.hdf5_path_for(serial, path.name)
hdf5_filename = hdf5_path.name if hdf5_path.exists() else None
hdf5_action = "kept"
need_h5 = not args.skip_hdf5 and (
args.force or not hdf5_path.exists() or sidecar_stale
need_h5 = (
not args.skip_hdf5
and (args.force or not hdf5_path.exists() or sidecar_stale)
and has_samples
)
if not has_samples and not args.skip_hdf5:
hdf5_action = "skipped-empty-samples"
if need_h5:
if args.dry_run:
hdf5_action = "would (re)write"
+9 -3
View File
@@ -289,9 +289,15 @@ def test_read_blastware_file_round_trip(tmp_path: Path):
assert parsed.timestamp.second == ev.timestamp.second
# No A5 source recoverable.
assert parsed._a5_frames is None
# Peaks computed from samples (synthetic = zero samples → zero peaks).
assert parsed.peak_values is not None
assert parsed.peak_values.peak_vector_sum == 0.0
# The synthetic event has no real waveform body, so the codec can't
# decode samples → read_blastware_file leaves peak_values=None
# (the "we don't know" signal) rather than fabricating all-zero
# peaks that would otherwise overwrite real DB values via UPSERT.
assert parsed.peak_values is None
assert parsed.raw_samples is not None
# Empty channels — codec returned None for the malformed synthetic body.
for ch in ("Tran", "Vert", "Long", "MicL"):
assert parsed.raw_samples[ch] == []
_BW_CODEC_FIXTURES = [