feat(import): v0.16.0 - Fully implemented series 3 BW-ACH pipeline stablized. #19

Merged
serversdown merged 9 commits from ach-report-ingestion into main 2026-05-11 15:55:24 -04:00
Owner

This branch has a fully working BW-ACH to SFM ingestion pipeline. Uses series3-watcher, pushes events and corresponding ASCII files to SFM. SFM then write it to SQLite DB, as well as generating all files needed sidecar files for event standardization.

This branch has a fully working BW-ACH to SFM ingestion pipeline. Uses series3-watcher, pushes events and corresponding ASCII files to SFM. SFM then write it to SQLite DB, as well as generating all files needed sidecar files for event standardization.
serversdown added 9 commits 2026-05-11 15:55:10 -04:00
Blastware's ACH writes a per-event ASCII report (.TXT) alongside each
event binary, containing the rich derived per-channel fields BW
computes (PPV, ZC Freq, Time of Peak, Peak Acceleration, Peak
Displacement, Peak Vector Sum + time, sensor self-check Pass/Fail,
monitor-log timestamps).  None of this lives in the BW binary itself.

When the watcher daemon forwards both files to /db/import/blastware_file
in one multipart POST, we now:

  - Pair binaries with their .TXT partners by filename match
  - Parse the report into a structured BwAsciiReport
  - Land the rich fields in a new top-level `bw_report` block of the
    sidecar JSON
  - Overlay the report's peaks/project_info/timestamp/sample_rate/
    record_time/total_samples/pretrig_samples onto the canonical
    sidecar fields (the report values are device-authoritative; the
    BW-binary STRT-derived values had bugs like reading the 0x46
    record-type marker as rectime)

This unblocks the monthly-summary review workflow — events become
sortable/filterable by peak, location, project, etc. — without
depending on the still-undecoded waveform body codec.
Blastware writes the operator-supplied fields with different label
spellings across firmware versions and recording modes — most
notably "Seis. Location" on histogram exports vs "Seis Loc:" on
waveform exports.  Previous parser only matched the latter, so
every histogram event silently lost its sensor_location field.

Replace the four hardcoded `key.rstrip(":") == "X"` branches with
a single `_OPERATOR_LABEL_MAP` dispatch table keyed by normalised
label (lowercase, trailing colon/period stripped, internal
whitespace collapsed).  Adds these variants on day 1:

  project:         "Project:" / "Project"
  client:          "Client:"  / "Client"
  operator:        "User Name:" / "User Name"
  sensor_location: "Seis Loc:" / "Seis. Location" / "Seis Location"
                 / "Sensor Location" / "Seis Loc"

To absorb future BW label drift, add a one-line dict entry — no
new elif branch.

14 new tests cover:
  - Each label variant routes to the correct field (parametrised)
  - Case-insensitive matching ("seis loc" / "SEIS LOC" / "SeIs LoC")
  - Whitespace-collapse ("Seis  Loc" with double-space)
  - End-to-end parse of a real histogram fixture from
    example-events/histogram/ — sensor_location ('Loc #1 - 2652 Hepner...')
    populates correctly even though the file uses "Seis. Location"

Total bw_ascii_report tests: 19 → 33.  Full SFM suite still green
(69 passed, 44 skipped — pre-existing skips for h5py-dep tests).

Pairs with series3-watcher v1.5.4 (which fixes the filename pairing
so histograms actually reach this parser in the first place).
The four operator-supplied note fields in BW's Compliance Setup →
Notes tab (Project / Client / User Name / Seis Loc) have
USER-EDITABLE LABELS — an operator can rename them in BW's UI to
"Building:", "Site Address:", "Inspector:", or anything else, and
the ASCII export writes those literal labels verbatim.  The
previous label-normalisation map approach (just added in commit
6a7e8c6) was fragile: it could only match label spellings we'd
enumerated in advance.  An operator using "Site:" instead of
"Seis Loc:" would have their sensor location silently dropped.

What IS reliable: BW always writes the 4 user-notes lines
contiguously, in the same order, between the "Units :" line and
the "Geo Range :" line of the export.  So parse them by POSITION:

  position 1 → project
  position 2 → client
  position 3 → operator
  position 4 → sensor_location

The original labels BW wrote are preserved in a new
`BwAsciiReport.user_note_labels` dict (canonical slot → literal
label string) so terra-view can render them as the operator named
them.

Removes the `_OPERATOR_LABEL_MAP` / `_normalise_label_for_lookup`
helpers and the elif-by-normalised-label branch in `parse_report`.
Replaces with a small state machine that flips on the "Units" line
and flips off on the "Geo Range" line.

Tests:
  - Default-label fixtures (waveform + histogram) still populate
    correctly, with operator's labels captured.
  - Synthetic custom-labelled exports ("Building:" / "Site Address:" /
    etc.) populate the right slots by position.
  - Histogram-specific "Seis. Location:" works.
  - Lines outside the Units→Geo Range range are ignored even if
    they look like user notes (defensive against malformed exports).
  - Partial blocks (fewer than 4 lines) leave later slots None.
  - Extra lines beyond 4 are dropped (5th slot doesn't exist).

26 tests in test_bw_ascii_report.py (was 33; net drop reflects
parametrised label tests collapsed into 6 focused position tests).
Full SFM suite: 62 passed, 44 skipped.

Pairs with series3-watcher v1.5.0 which fixes the filename pairing
so the report reaches this parser in the first place.
The /db/import/blastware_file endpoint was bucketing every
forwarded event into serial='UNKNOWN' in the DB.  WaveformStore
correctly decoded the serial from the BW filename and saved
files to <store>/<serial>/<filename> (e.g.
.../BE17353/S353L5KC.DR0H.h5), but the endpoint code called
db.insert_events(serial=_serial_from_event(ev)) — and
_serial_from_event was a stub that always returned None,
falling back to "UNKNOWN".

Effect on the user's prod server: 3,039 events forwarded across
24 distinct units, ALL inserted under serial='UNKNOWN'.  The
on-disk waveform store + sidecars + HDF5s were fine, but the
SFM webapp's /db/units only showed the two original manually-
uploaded serials because every forwarded row had its serial
column zeroed to UNKNOWN.

Fix:
  - WaveformStore.save_imported_bw() now surfaces the decoded
    serial on the returned `rec` dict (rec["serial"]).
  - The import endpoint uses rec["serial"] as the authoritative
    fallback when the operator hasn't supplied a serial_hint query
    parameter.  Order of precedence:
      query string `serial` → rec["serial"] → _serial_from_event(ev) → "UNKNOWN"
  - Response payload now includes `serial` per file so the watcher
    log lines (or any future caller) can see which unit each event
    was attributed to.

Recovery for existing DB rows:
  scripts/repair_unknown_serials.py walks the events table looking
  for rows with serial='UNKNOWN' and re-attributes each one to the
  serial decoded from blastware_filename.  Updates the row in place
  unless the target (serial, timestamp) already has a row, in which
  case the UNKNOWN duplicate is deleted.  Idempotent.  Default
  dry-run; pass --apply to commit.

  Verified on the user's actual DB (dry-run):
    UNKNOWN rows scanned:       3039
    Updated to real serial:     2602
    Deleted (duplicate of an
     already-correct row):      437
    Unresolved (bad filename):  0

After running the repair, /db/units will show all 24 units
correctly populated.
Previous query_units() only joined on ach_sessions, which is created
exclusively by the live ACH server.  The BW-importer path
(/db/import/blastware_file → WaveformStore.save_imported_bw →
SeismoDb.insert_events) populates `events` but never creates an
ach_sessions row.  Consequence: every serial whose events flowed in
through the series3-watcher forwarder was invisible to
/db/units (and therefore to the SFM webapp's fleet overview / units
list), even though the events were correctly populated in the
events table with proper serial attribution.

Rewrite query_units() to aggregate from BOTH tables and union the
serials:
  - total_events / last_event_at  come from `events` (every ingest path)
  - last_session_at / total_monitor_entries / total_sessions
                                  come from `ach_sessions` (ACH-only),
                                  0 when no sessions exist for the serial
  - last_seen = max(last_event_at, last_session_at)

Verified on the user's actual prod DB after the
repair_unknown_serials run: /db/units now returns 24 serials instead
of 2.  All 3,257 watcher-forwarded events become visible in the
fleet overview without any further DB surgery.
Two compounding bugs caused forwarded events to land in the DB with
broken-codec peak values (~10 in/s saturation on every channel) and
no project info, even when the watcher correctly paired a BW ASCII
report with the binary.

Bug 1: save_imported_bw built the sidecar JSON with the report's
authoritative peak / project values via event_to_sidecar_dict(
bw_report=...), but never overlaid those onto the in-memory Event
that flows to db.insert_events().  So the DB row got peak_values
from read_blastware_file()._peaks_from_samples() — which runs the
still-undecoded waveform body codec assuming raw int16 LE and
produces ±32K-shaped noise (= ±10 in/s at Normal range) regardless
of the actual signal.  The sidecar JSON had the truth but the DB
columns (which the webapp queries for fast filter/sort) lied.

Bug 2: insert_events' IntegrityError handler only refreshed the
filename/filesize/a5_pickle/sidecar columns when a duplicate
(serial, timestamp) was seen.  Peak values, project info,
sample_rate, record_type stayed locked in at whatever the FIRST
insert wrote.  So even after Bug 1 was fixed, the historical
events in the DB (already inserted with broken-codec peaks) would
never get their values corrected, because a re-forward would just
hit IntegrityError and skip the field refresh.

Fix 1 (minimateplus/event_file_io.py + sfm/waveform_store.py):
  - New apply_report_to_event(event, report) helper folds the BW
    report's device-authoritative fields onto the Event in-place:
    per-channel PPV, peak vector sum, mic PSPL→psi, project /
    client / operator / sensor_location, sample_rate, record_time.
  - save_imported_bw() calls the helper right after parsing the
    report.  The Event that flows to insert_events() now carries
    correct values.

Fix 2 (sfm/database.py):
  - insert_events()'s IntegrityError UPDATE now refreshes every
    device-authoritative column from the new data: tran_ppv,
    vert_ppv, long_ppv, peak_vector_sum, mic_ppv, project, client,
    operator, sensor_location, sample_rate, record_type, plus
    the existing filename/filesize/a5_pickle/sidecar fields.
  - Preserves: id, waveform_key, session_id, created_at (immutable
    / FK fields), and false_trigger (operator review state).

End-to-end simulation verified:
  - Step 1: import without report → DB has ±10 in/s peaks, no project
  - Step 2: re-import WITH report → upsert path fires, DB now has
            device-authoritative 0.005 in/s peaks + sensor_location
  - Step 3: operator sets false_trigger=1, re-import again → flag
            preserved, peaks remain correct

For the user's situation: deleting the watcher state file forces a
re-forward of all events.  Each re-forward now pairs with its
_ASCII.TXT, applies the report onto the Event, and the upsert
refreshes the DB row.  No DB nuke needed.

Full SFM suite: 62 passed, 44 skipped.
The series3-watcher v1.5.0 fix taught the WATCHER to look for BW
ACH's _ASCII.TXT report alongside each binary.  But the SFM
SERVER's import endpoint only knew about the legacy <binary>.TXT
naming when building its TXT lookup table.

Effect: even though the watcher correctly shipped both files in
the multipart POST (and logged "+ <name>_ASCII.TXT attached"),
the server's reports dict was keyed on the wrong name, so
report_bytes resolved to None for every event.  Without the
report, save_imported_bw fell back to broken-codec peak values
and no project info — exactly the same symptom as before the
watcher fix landed, just for a different reason.

Fix: when stripping the ".TXT" suffix, also recognise the
"_ASCII" trailer and reconstruct the binary's filename by
converting the last "_" back to ".".  Register the report under
BOTH possible binary names so the subsequent lookup matches
whichever convention the operator's BW installation uses.

  ACH convention (Blastware ACH):
    binary T003L2G6.0E0H  + report T003L2G6_0E0H_ASCII.TXT  
  Manual export (operator clicks Save As Text in BW):
    binary M529LK44.AB0   + report M529LK44.AB0.TXT          
  Both for same event (e.g. ACH + operator manual save):
    register under both names; binary lookup wins             

Smoke-tested against the four real fixture filenames in the
project archive.  Full SFM suite still 62 pass.

For the user's situation: pull, restart, and the NEXT re-forward
pass (after deleting watcher state file again if needed) will
hit this code path, parse the report correctly, apply the
overlay onto the Event, and the upsert path will land
authoritative peak values + project info in the DB.
The "BW ACH ingestion" release.  Paired with series3-watcher v1.5.0,
every Blastware ACH event (binary + _ASCII.TXT report) lands in
SeismoDb with device-authoritative peaks, project metadata, sensor
self-check, and ZC/Time-of-Peak data — without depending on the
still-undecoded waveform body codec.

Bumps pyproject.toml + minimateplus/event_file_io.py TOOL_VERSION
to 0.16.0.  README banner + CHANGELOG entry summarise the work
that landed across commits cdfe4ad..f83993a on this branch.
Consolidates everything that was floating in chat-only "parking
lot" status into the README's Roadmap (Future) section:

  High-impact (unblocks product features):
    - Waveform body codec reverse-engineering
    - In-app waveform viewer accuracy (depends on codec)
    - Terra-view integration
    - Vibration summary reports

  BW ASCII report parser enhancements:
    - Histogram-specific structural fields
    - Histogram interval bin-table parsing
    - ">100 Hz" value parsing

  Ingestion gaps:
    - MLG forwarding (watcher + SFM endpoint)
    - 0C-record raw bytes persistence in sidecar

  Operational:
    - series3-watcher file archive manager
    - Existing operational items (compliance encoder, modem manager,
      Call Home dial_string write, histogram mode 5A stream)

  Test coverage + lower-priority cleanups.

CLAUDE.md "What's next" section now points to the README as the
canonical deferred-work list, and keeps its own low-level technical
status log for byte-layout details that don't belong in the
roadmap.
serversdown merged commit d2e48c62b5 into main 2026-05-11 15:55:24 -04:00
Sign in to join this conversation.