backfill_sidecars: filter out Thor IDF files

Discovered while dry-running the backfill on prod: the waveform store
contains both BW (.AB0*/.N00) and Thor IDF (.IDFW/.IDFH) event files
side-by-side because both go through the same per-serial directory
layout.  The script's `_looks_like_event_file` heuristic accepted any
3-4 char extension ending in W or H, which matched both BW and IDF.

The script then routes everything through
`event_file_io.read_blastware_file`, which rejects IDF files with
"not a Blastware file (bad header prefix)" — 3807 errors on prod
out of 7201 total events.

Thor IDF events have their own ingest path
(`WaveformStore.save_imported_idf`) and their sidecars are populated
at ingest from the paired `.IDFW.txt` ASCII report.  The backfill
script has no value to add for them — there's no decoder to refresh,
and the sidecar metadata is already correct.  Filter them out.

After this fix, the prod backfill should run clean: ~3392 BW events
get sidecar+h5 regen as expected; the ~3807 Thor IDF events are
silently skipped.

The proper "IDF backfill" (refresh tool_version stamp on IDF
sidecars by re-running event_to_sidecar_dict against the stored
DB row + sidecar extensions block) is a separate, narrower
follow-up — not blocking the BW backfill rollout.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-21 01:20:08 +00:00
parent 7183b953e4
commit 88549bc659
+14 -2
View File
@@ -54,14 +54,26 @@ log = logging.getLogger("backfill_sidecars")
def _looks_like_event_file(path: Path) -> bool: def _looks_like_event_file(path: Path) -> bool:
"""Same heuristic as the importer CLI.""" """Same heuristic as the importer CLI.
Filters to BW (Series III) event files only — Thor (Series IV)
`.IDFW` / `.IDFH` files share the store but have their own ingest
path (`WaveformStore.save_imported_idf`) and are NOT decodable by
`event_file_io.read_blastware_file`. Their sidecars are populated
at ingest from the paired `.IDFW.txt` ASCII report; nothing the
backfill regenerates would improve on them, so we exclude them
from scope.
"""
if not path.is_file(): if not path.is_file():
return False return False
if path.name.endswith((".a5.pkl", ".sfm.json")): if path.name.endswith((".a5.pkl", ".sfm.json", ".h5")):
return False return False
ext = path.suffix.lstrip(".") ext = path.suffix.lstrip(".")
if not (3 <= len(ext) <= 4): if not (3 <= len(ext) <= 4):
return False return False
# Thor IDF files share the .{W,H}-suffix shape but aren't BW.
if ext.upper() in ("IDFW", "IDFH"):
return False
if not (ext[-1].upper() in {"W", "H"} or ext.endswith("0")): if not (ext[-1].upper() in {"W", "H"} or ext.endswith("0")):
return False return False
try: try: