feat(sfm): Phase 5a — bulk-backfill projects/locations/assignments from event metadata

Operator clicks one button. Parser reads SFM's events table (operator-typed project / client / sensor_location strings), clusters by serial + time + metadata, fuzzy-matches against existing projects, and proposes Project / MonitoringLocation / UnitAssignment chains to create. Auto-applies high-confidence non-conflicting clusters in bulk; queues medium/low confidence for individual review. Verified against real data: 10,052 events → 59 clusters → 37 high- confidence + 14 medium + 8 low. Test-applied one cluster end-to-end; Project + Module + Location + Assignment + UnitHistory + Decision rows all created correctly, and Phase 2's attribution walk picked up the events automatically on the new location's detail page. Pipeline (backend/services/metadata_backfill.py, ~700 lines): 1. Pull all SFM events via /db/events per serial. 2. Pre-filter: drop events already covered by an existing UnitAssignment window (Phase 2 handles those automatically). 3. Time-cluster what's left: serial + 7-day gap is the cluster identity. 4. Metadata-split each time-cluster on persistent metadata transitions (≥ 2 consecutive events) so a single typo doesn't fork the cluster. 5. Match against existing graph (rapidfuzz.WRatio multi-signal scoring, normalisation that handles abbreviations / reorders / separator variations). Thresholds: 0.95 exact, 0.80 fuzzy, min-shorter-input 5 chars to guardrail false positives on single common words. 6. Score confidence (high/medium/low) using event count, span, blank-meta, conflict, ambiguity rules. 7. Detect conflicts: overlap with existing UnitAssignment at a different location for the same serial → blocking. Operator must reconcile. 8. Apply: ensure auto_imported ProjectType exists, ensure vibration_monitoring ProjectModule on the project, write Project / MonitoringLocation / UnitAssignment / UnitHistory all in one transaction. Migration (backend/migrate_add_metadata_backfill.py): adds unit_assignments.source column (default 'manual') and metadata_backfill_decisions table. Idempotent, non-destructive. API (backend/routers/metadata_backfill.py): GET /api/admin/metadata_backfill/scan — clusters + suggestions POST /api/admin/metadata_backfill/apply — bulk apply by cluster_ids w/ optional per-cluster project/location overrides POST /api/admin/metadata_backfill/skip — mark skipped (persistent) UI (templates/admin/metadata_backfill.html, accessible at /settings/developer/metadata-backfill via the Developer tab of Settings): - One-button "Run scan" entry. - Summary KPI tiles (scanned / already attributed / pending / conflicts). - "Apply all high-confidence" bulk button at the top — primary path. - Per-cluster cards below with Apply / Skip / Preview event actions. - Blank-meta clusters get inline input fields for operator-typed project + location names before applying. - Blocking-conflict clusters render with the conflicting assignment information and a disabled Apply button. - Live progress toast during apply. - Reuses the Phase 1+2+4 event-detail modal for "Preview event" — operator can sanity-check the BW report data against the cluster's sample event. Dependencies: rapidfuzz==3.10.1 added to requirements.txt. Pre-built C wheels for all platforms, ~5s docker build hit. Phase 5b (deferred to next session): swap-detection daily background job, notification inbox for auto-applied swaps, recently-applied audit view, "Tidy" page for renaming/merging auto-created projects. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 05:54:57 +00:00
parent 21844b4d65
commit 42de06f441
8 changed files with 1828 additions and 0 deletions
@@ -259,9 +259,48 @@ class UnitAssignment(Base):
    device_type = Column(String, nullable=False)  # "slm" | "seismograph"
    project_id = Column(String, nullable=False, index=True)  # FK to Project.id

+    # Provenance: how was this assignment created?  Used for auditing,
+    # bulk-undo of parser actions, and the Phase 4 deployment timeline.
+    #   "manual"                  — operator created via UI
+    #   "metadata_backfill"       — auto-created by the metadata parser
+    #                                from operator-typed BW event metadata
+    #                                (bulk backfill workflow)
+    #   "metadata_backfill_swap"  — auto-created by swap-detection
+    #                                background job
+    source = Column(String, nullable=False, default="manual")
+
    created_at = Column(DateTime, default=datetime.utcnow)


+class MetadataBackfillDecision(Base):
+    """
+    Per-cluster decisions tracked by the metadata-backfill parser.
+
+    `cluster_id` is the deterministic SHA1 hash of
+    (serial, first_event_date, last_event_date), so the same cluster
+    produces the same id across re-scans.  The decisions table lets the
+    parser remember "I already applied this" or "operator skipped this"
+    across scan invocations.
+    """
+    __tablename__ = "metadata_backfill_decisions"
+
+    cluster_id            = Column(String, primary_key=True)
+    status                = Column(String, nullable=False)   # pending | applied | skipped | conflict
+    confidence            = Column(String, nullable=False)   # high | medium | low
+    decided_at            = Column(DateTime, nullable=True)
+    decided_by            = Column(String, nullable=True)    # background | operator | auto-high
+    applied_assignment_id = Column(String, nullable=True)    # FK to unit_assignments.id
+    notes                 = Column(Text,   nullable=True)
+    first_seen_at         = Column(DateTime, nullable=False, default=datetime.utcnow)
+    last_seen_at          = Column(DateTime, nullable=False, default=datetime.utcnow)
+    serial                = Column(String, nullable=False, index=True)
+    project_raw           = Column(String, nullable=True)
+    location_raw          = Column(String, nullable=True)
+    first_event_ts        = Column(DateTime, nullable=True)
+    last_event_ts         = Column(DateTime, nullable=True)
+    event_count           = Column(Integer, nullable=False, default=0)
+
+
 class ScheduledAction(Base):
    """
    Scheduled actions: automation for recording start/stop/download.