feat(sfm): Phase 5a — bulk-backfill projects/locations/assignments from event metadata

Operator clicks one button. Parser reads SFM's events table (operator-typed project / client / sensor_location strings), clusters by serial + time + metadata, fuzzy-matches against existing projects, and proposes Project / MonitoringLocation / UnitAssignment chains to create. Auto-applies high-confidence non-conflicting clusters in bulk; queues medium/low confidence for individual review. Verified against real data: 10,052 events → 59 clusters → 37 high- confidence + 14 medium + 8 low. Test-applied one cluster end-to-end; Project + Module + Location + Assignment + UnitHistory + Decision rows all created correctly, and Phase 2's attribution walk picked up the events automatically on the new location's detail page. Pipeline (backend/services/metadata_backfill.py, ~700 lines): 1. Pull all SFM events via /db/events per serial. 2. Pre-filter: drop events already covered by an existing UnitAssignment window (Phase 2 handles those automatically). 3. Time-cluster what's left: serial + 7-day gap is the cluster identity. 4. Metadata-split each time-cluster on persistent metadata transitions (≥ 2 consecutive events) so a single typo doesn't fork the cluster. 5. Match against existing graph (rapidfuzz.WRatio multi-signal scoring, normalisation that handles abbreviations / reorders / separator variations). Thresholds: 0.95 exact, 0.80 fuzzy, min-shorter-input 5 chars to guardrail false positives on single common words. 6. Score confidence (high/medium/low) using event count, span, blank-meta, conflict, ambiguity rules. 7. Detect conflicts: overlap with existing UnitAssignment at a different location for the same serial → blocking. Operator must reconcile. 8. Apply: ensure auto_imported ProjectType exists, ensure vibration_monitoring ProjectModule on the project, write Project / MonitoringLocation / UnitAssignment / UnitHistory all in one transaction. Migration (backend/migrate_add_metadata_backfill.py): adds unit_assignments.source column (default 'manual') and metadata_backfill_decisions table. Idempotent, non-destructive. API (backend/routers/metadata_backfill.py): GET /api/admin/metadata_backfill/scan — clusters + suggestions POST /api/admin/metadata_backfill/apply — bulk apply by cluster_ids w/ optional per-cluster project/location overrides POST /api/admin/metadata_backfill/skip — mark skipped (persistent) UI (templates/admin/metadata_backfill.html, accessible at /settings/developer/metadata-backfill via the Developer tab of Settings): - One-button "Run scan" entry. - Summary KPI tiles (scanned / already attributed / pending / conflicts). - "Apply all high-confidence" bulk button at the top — primary path. - Per-cluster cards below with Apply / Skip / Preview event actions. - Blank-meta clusters get inline input fields for operator-typed project + location names before applying. - Blocking-conflict clusters render with the conflicting assignment information and a disabled Apply button. - Live progress toast during apply. - Reuses the Phase 1+2+4 event-detail modal for "Preview event" — operator can sanity-check the BW report data against the cluster's sample event. Dependencies: rapidfuzz==3.10.1 added to requirements.txt. Pre-built C wheels for all platforms, ~5s docker build hit. Phase 5b (deferred to next session): swap-detection daily background job, notification inbox for auto-applied swaps, recently-applied audit view, "Tidy" page for renaming/merging auto-created projects. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 05:54:57 +00:00
parent 21844b4d65
commit 42de06f441
8 changed files with 1828 additions and 0 deletions
@@ -0,0 +1,226 @@
+"""
+Metadata-backfill admin router.
+
+Endpoints under /api/admin/metadata_backfill:
+
+  GET  /scan       — run the scan; return clusters + suggestions (JSON).
+                     Cached 5 minutes so the wizard doesn't re-scan on
+                     every page render.
+  POST /apply      — apply a list of cluster_ids; body specifies which to
+                     accept and optional per-cluster overrides.
+  POST /skip       — mark cluster_ids as skipped (won't reappear).
+"""
+
+from __future__ import annotations
+
+import os
+import time
+from typing import Optional
+
+from fastapi import APIRouter, Depends, HTTPException, Request
+from fastapi.responses import JSONResponse
+from sqlalchemy.orm import Session
+
+from backend.database import get_db
+from backend.services import metadata_backfill as svc
+
+router = APIRouter(prefix="/api/admin/metadata_backfill", tags=["metadata-backfill"])
+
+SFM_BASE_URL = os.getenv("SFM_BASE_URL", "http://localhost:8200")
+
+# In-process scan cache.  Trades memory for not re-hammering SFM on every
+# wizard render.  TTL: 5 minutes.  Singleton per-process; fine for a
+# single-worker uvicorn dev setup.  For prod multi-worker we'd want to put
+# this in the DB or Redis; deferred.
+_SCAN_CACHE: dict = {"at": 0.0, "result": None}
+_SCAN_CACHE_TTL_SECONDS = 300.0
+
+
+def _serialise_suggestion(s: svc.Suggestion) -> dict:
+    c = s.cluster
+    return {
+        "cluster_id":              c.cluster_id,
+        "serial":                  c.serial,
+        "first_event_ts":          c.first_event_ts.isoformat(),
+        "last_event_ts":           c.last_event_ts.isoformat(),
+        "event_count":             c.event_count,
+        "sample_event_id":         c.sample_event_id,
+        "project_raw":             c.project_raw,
+        "location_raw":            c.location_raw,
+        "client_raw":              c.client_raw,
+        "operator_raw":            c.operator_raw,
+        "is_blank_meta":           c.is_blank_meta,
+        "metadata_consistency":    c.metadata_consistency,
+
+        "project_match":           s.project_match,
+        "project_existing_id":     s.project_existing_id,
+        "project_existing_name":   s.project_existing_name,
+        "project_match_score":     s.project_match_score,
+        "project_suggested_name":  s.project_suggested_name,
+
+        "location_match":          s.location_match,
+        "location_existing_id":    s.location_existing_id,
+        "location_existing_name":  s.location_existing_name,
+        "location_match_score":    s.location_match_score,
+        "location_suggested_name": s.location_suggested_name,
+
+        "proposed_assigned_at":    s.proposed_assigned_at.isoformat(),
+        "proposed_assigned_until": s.proposed_assigned_until.isoformat() if s.proposed_assigned_until else None,
+
+        "confidence":              s.confidence,
+        "blocking_conflict":       s.blocking_conflict,
+        "conflicts": [
+            {
+                "existing_assignment_id": cf.existing_assignment_id,
+                "other_location_id":      cf.other_location_id,
+                "other_location_name":    cf.other_location_name,
+                "other_project_id":       cf.other_project_id,
+                "other_project_name":     cf.other_project_name,
+            }
+            for cf in s.conflicts
+        ],
+    }
+
+
+@router.get("/scan")
+async def scan(
+    force: bool = False,
+    db: Session = Depends(get_db),
+):
+    """Run a scan and return clusters + suggestions.
+
+    Set force=true to bypass the 5-minute cache.
+    """
+    now = time.time()
+    if not force and _SCAN_CACHE["result"] is not None \
+            and (now - _SCAN_CACHE["at"]) < _SCAN_CACHE_TTL_SECONDS:
+        return _SCAN_CACHE["result"]
+
+    result = await svc.scan_clusters_and_build_suggestions(db, SFM_BASE_URL)
+
+    # Group suggestions for the wizard UI.
+    by_confidence = {"high": [], "medium": [], "low": []}
+    blocking_conflict_count = 0
+    for s in result.suggestions:
+        by_confidence[s.confidence].append(_serialise_suggestion(s))
+        if s.blocking_conflict:
+            blocking_conflict_count += 1
+
+    payload = {
+        "scanned_event_count":     result.scanned_event_count,
+        "cluster_count":           result.cluster_count,
+        "already_attributed":      result.already_attributed,
+        "skipped_orphans":         result.skipped_orphans,
+        "pending_count":           len(result.suggestions),
+        "blocking_conflict_count": blocking_conflict_count,
+        "by_confidence": {
+            "high":   by_confidence["high"],
+            "medium": by_confidence["medium"],
+            "low":    by_confidence["low"],
+        },
+        "scanned_at":              now,
+    }
+    _SCAN_CACHE["result"] = payload
+    _SCAN_CACHE["at"]     = now
+    return payload
+
+
+@router.post("/apply")
+async def apply(
+    request: Request,
+    db: Session = Depends(get_db),
+):
+    """Apply a list of clusters.
+
+    Body:
+      {
+        "cluster_ids": ["abc...", "def..."],
+        "overrides":   { "abc...": { "project_name": "...", "location_name": "..." } }
+      }
+
+    To accept ALL non-conflict suggestions in one shot, the UI sends every
+    pending cluster_id with no overrides.
+    """
+    try:
+        body = await request.json()
+    except Exception:
+        raise HTTPException(status_code=400, detail="Invalid JSON body")
+
+    cluster_ids = body.get("cluster_ids") or []
+    overrides   = body.get("overrides") or {}
+    if not isinstance(cluster_ids, list) or not cluster_ids:
+        raise HTTPException(status_code=400, detail="cluster_ids must be a non-empty list")
+
+    # Re-scan to get current suggestions.  We don't trust the cached scan
+    # blindly — the operator might have manually created projects in
+    # between scan and apply.
+    scan_result = await svc.scan_clusters_and_build_suggestions(db, SFM_BASE_URL)
+    suggestions_by_id = {s.cluster.cluster_id: s for s in scan_result.suggestions}
+
+    selected: list[svc.Suggestion] = []
+    not_found: list[str] = []
+    for cid in cluster_ids:
+        s = suggestions_by_id.get(cid)
+        if s is None:
+            not_found.append(cid)
+            continue
+        # Apply overrides.
+        ov = overrides.get(cid) or {}
+        if "project_name" in ov:
+            s.project_suggested_name = (ov["project_name"] or "").strip() or s.project_suggested_name
+            # Override implies operator wants to create new (or rename).
+            # If they wanted an exact match, they'd not have overridden.
+            if s.project_match in ("create_new",):
+                pass  # keep create_new
+            else:
+                # Operator typed a custom name — force create-new behaviour
+                # so we don't accidentally attach to a different existing
+                # project by exact-match.
+                s.project_existing_id = None
+                s.project_match = "create_new"
+        if "location_name" in ov:
+            s.location_suggested_name = (ov["location_name"] or "").strip() or s.location_suggested_name
+            if s.location_match in ("create_new",):
+                pass
+            else:
+                s.location_existing_id = None
+                s.location_match = "create_new"
+        selected.append(s)
+
+    apply_result = svc.apply_suggestions(db, selected, decided_by="operator")
+
+    # Invalidate the scan cache so the next /scan picks up the new state.
+    _SCAN_CACHE["at"] = 0.0
+    _SCAN_CACHE["result"] = None
+
+    return {
+        "applied":              apply_result.applied,
+        "failed":               [{"cluster_id": cid, "reason": r} for cid, r in apply_result.failed],
+        "not_found":            not_found,
+        "project_ids_created":  apply_result.project_ids_created,
+        "location_ids_created": apply_result.location_ids_created,
+        "assignment_ids_created": apply_result.assignment_ids_created,
+    }
+
+
+@router.post("/skip")
+async def skip(
+    request: Request,
+    db: Session = Depends(get_db),
+):
+    """Mark cluster_ids as skipped — they won't reappear in future scans."""
+    try:
+        body = await request.json()
+    except Exception:
+        raise HTTPException(status_code=400, detail="Invalid JSON body")
+
+    cluster_ids = body.get("cluster_ids") or []
+    if not isinstance(cluster_ids, list):
+        raise HTTPException(status_code=400, detail="cluster_ids must be a list")
+
+    n = svc.skip_clusters(db, cluster_ids, decided_by="operator")
+
+    _SCAN_CACHE["at"] = 0.0
+    _SCAN_CACHE["result"] = None
+
+    return {"skipped": n}