feat(sfm): Phase 5a — bulk-backfill projects/locations/assignments from event metadata

Operator clicks one button.  Parser reads SFM's events table (operator-typed
project / client / sensor_location strings), clusters by serial + time +
metadata, fuzzy-matches against existing projects, and proposes
Project / MonitoringLocation / UnitAssignment chains to create.
Auto-applies high-confidence non-conflicting clusters in bulk; queues
medium/low confidence for individual review.

Verified against real data: 10,052 events → 59 clusters → 37 high-
confidence + 14 medium + 8 low.  Test-applied one cluster end-to-end;
Project + Module + Location + Assignment + UnitHistory + Decision rows
all created correctly, and Phase 2's attribution walk picked up the
events automatically on the new location's detail page.

Pipeline (backend/services/metadata_backfill.py, ~700 lines):
  1. Pull all SFM events via /db/events per serial.
  2. Pre-filter: drop events already covered by an existing UnitAssignment
     window (Phase 2 handles those automatically).
  3. Time-cluster what's left: serial + 7-day gap is the cluster identity.
  4. Metadata-split each time-cluster on persistent metadata transitions
     (≥ 2 consecutive events) so a single typo doesn't fork the cluster.
  5. Match against existing graph (rapidfuzz.WRatio multi-signal scoring,
     normalisation that handles abbreviations / reorders / separator
     variations).  Thresholds: 0.95 exact, 0.80 fuzzy, min-shorter-input
     5 chars to guardrail false positives on single common words.
  6. Score confidence (high/medium/low) using event count, span,
     blank-meta, conflict, ambiguity rules.
  7. Detect conflicts: overlap with existing UnitAssignment at a different
     location for the same serial → blocking.  Operator must reconcile.
  8. Apply: ensure auto_imported ProjectType exists, ensure
     vibration_monitoring ProjectModule on the project, write
     Project / MonitoringLocation / UnitAssignment / UnitHistory all in
     one transaction.

Migration (backend/migrate_add_metadata_backfill.py): adds
unit_assignments.source column (default 'manual') and
metadata_backfill_decisions table.  Idempotent, non-destructive.

API (backend/routers/metadata_backfill.py):
  GET  /api/admin/metadata_backfill/scan          — clusters + suggestions
  POST /api/admin/metadata_backfill/apply         — bulk apply by cluster_ids
                                                     w/ optional per-cluster
                                                     project/location overrides
  POST /api/admin/metadata_backfill/skip          — mark skipped (persistent)

UI (templates/admin/metadata_backfill.html, accessible at
/settings/developer/metadata-backfill via the Developer tab of Settings):
  - One-button "Run scan" entry.
  - Summary KPI tiles (scanned / already attributed / pending / conflicts).
  - "Apply all high-confidence" bulk button at the top — primary path.
  - Per-cluster cards below with Apply / Skip / Preview event actions.
  - Blank-meta clusters get inline input fields for operator-typed project +
    location names before applying.
  - Blocking-conflict clusters render with the conflicting assignment
    information and a disabled Apply button.
  - Live progress toast during apply.
  - Reuses the Phase 1+2+4 event-detail modal for "Preview event" — operator
    can sanity-check the BW report data against the cluster's sample event.

Dependencies: rapidfuzz==3.10.1 added to requirements.txt.  Pre-built C
wheels for all platforms, ~5s docker build hit.

Phase 5b (deferred to next session): swap-detection daily background job,
notification inbox for auto-applied swaps, recently-applied audit view,
"Tidy" page for renaming/merging auto-created projects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-12 05:54:57 +00:00
parent 21844b4d65
commit 42de06f441
8 changed files with 1828 additions and 0 deletions
+226
View File
@@ -0,0 +1,226 @@
"""
Metadata-backfill admin router.
Endpoints under /api/admin/metadata_backfill:
GET /scan — run the scan; return clusters + suggestions (JSON).
Cached 5 minutes so the wizard doesn't re-scan on
every page render.
POST /apply — apply a list of cluster_ids; body specifies which to
accept and optional per-cluster overrides.
POST /skip — mark cluster_ids as skipped (won't reappear).
"""
from __future__ import annotations
import os
import time
from typing import Optional
from fastapi import APIRouter, Depends, HTTPException, Request
from fastapi.responses import JSONResponse
from sqlalchemy.orm import Session
from backend.database import get_db
from backend.services import metadata_backfill as svc
router = APIRouter(prefix="/api/admin/metadata_backfill", tags=["metadata-backfill"])
SFM_BASE_URL = os.getenv("SFM_BASE_URL", "http://localhost:8200")
# In-process scan cache. Trades memory for not re-hammering SFM on every
# wizard render. TTL: 5 minutes. Singleton per-process; fine for a
# single-worker uvicorn dev setup. For prod multi-worker we'd want to put
# this in the DB or Redis; deferred.
_SCAN_CACHE: dict = {"at": 0.0, "result": None}
_SCAN_CACHE_TTL_SECONDS = 300.0
def _serialise_suggestion(s: svc.Suggestion) -> dict:
c = s.cluster
return {
"cluster_id": c.cluster_id,
"serial": c.serial,
"first_event_ts": c.first_event_ts.isoformat(),
"last_event_ts": c.last_event_ts.isoformat(),
"event_count": c.event_count,
"sample_event_id": c.sample_event_id,
"project_raw": c.project_raw,
"location_raw": c.location_raw,
"client_raw": c.client_raw,
"operator_raw": c.operator_raw,
"is_blank_meta": c.is_blank_meta,
"metadata_consistency": c.metadata_consistency,
"project_match": s.project_match,
"project_existing_id": s.project_existing_id,
"project_existing_name": s.project_existing_name,
"project_match_score": s.project_match_score,
"project_suggested_name": s.project_suggested_name,
"location_match": s.location_match,
"location_existing_id": s.location_existing_id,
"location_existing_name": s.location_existing_name,
"location_match_score": s.location_match_score,
"location_suggested_name": s.location_suggested_name,
"proposed_assigned_at": s.proposed_assigned_at.isoformat(),
"proposed_assigned_until": s.proposed_assigned_until.isoformat() if s.proposed_assigned_until else None,
"confidence": s.confidence,
"blocking_conflict": s.blocking_conflict,
"conflicts": [
{
"existing_assignment_id": cf.existing_assignment_id,
"other_location_id": cf.other_location_id,
"other_location_name": cf.other_location_name,
"other_project_id": cf.other_project_id,
"other_project_name": cf.other_project_name,
}
for cf in s.conflicts
],
}
@router.get("/scan")
async def scan(
force: bool = False,
db: Session = Depends(get_db),
):
"""Run a scan and return clusters + suggestions.
Set force=true to bypass the 5-minute cache.
"""
now = time.time()
if not force and _SCAN_CACHE["result"] is not None \
and (now - _SCAN_CACHE["at"]) < _SCAN_CACHE_TTL_SECONDS:
return _SCAN_CACHE["result"]
result = await svc.scan_clusters_and_build_suggestions(db, SFM_BASE_URL)
# Group suggestions for the wizard UI.
by_confidence = {"high": [], "medium": [], "low": []}
blocking_conflict_count = 0
for s in result.suggestions:
by_confidence[s.confidence].append(_serialise_suggestion(s))
if s.blocking_conflict:
blocking_conflict_count += 1
payload = {
"scanned_event_count": result.scanned_event_count,
"cluster_count": result.cluster_count,
"already_attributed": result.already_attributed,
"skipped_orphans": result.skipped_orphans,
"pending_count": len(result.suggestions),
"blocking_conflict_count": blocking_conflict_count,
"by_confidence": {
"high": by_confidence["high"],
"medium": by_confidence["medium"],
"low": by_confidence["low"],
},
"scanned_at": now,
}
_SCAN_CACHE["result"] = payload
_SCAN_CACHE["at"] = now
return payload
@router.post("/apply")
async def apply(
request: Request,
db: Session = Depends(get_db),
):
"""Apply a list of clusters.
Body:
{
"cluster_ids": ["abc...", "def..."],
"overrides": { "abc...": { "project_name": "...", "location_name": "..." } }
}
To accept ALL non-conflict suggestions in one shot, the UI sends every
pending cluster_id with no overrides.
"""
try:
body = await request.json()
except Exception:
raise HTTPException(status_code=400, detail="Invalid JSON body")
cluster_ids = body.get("cluster_ids") or []
overrides = body.get("overrides") or {}
if not isinstance(cluster_ids, list) or not cluster_ids:
raise HTTPException(status_code=400, detail="cluster_ids must be a non-empty list")
# Re-scan to get current suggestions. We don't trust the cached scan
# blindly — the operator might have manually created projects in
# between scan and apply.
scan_result = await svc.scan_clusters_and_build_suggestions(db, SFM_BASE_URL)
suggestions_by_id = {s.cluster.cluster_id: s for s in scan_result.suggestions}
selected: list[svc.Suggestion] = []
not_found: list[str] = []
for cid in cluster_ids:
s = suggestions_by_id.get(cid)
if s is None:
not_found.append(cid)
continue
# Apply overrides.
ov = overrides.get(cid) or {}
if "project_name" in ov:
s.project_suggested_name = (ov["project_name"] or "").strip() or s.project_suggested_name
# Override implies operator wants to create new (or rename).
# If they wanted an exact match, they'd not have overridden.
if s.project_match in ("create_new",):
pass # keep create_new
else:
# Operator typed a custom name — force create-new behaviour
# so we don't accidentally attach to a different existing
# project by exact-match.
s.project_existing_id = None
s.project_match = "create_new"
if "location_name" in ov:
s.location_suggested_name = (ov["location_name"] or "").strip() or s.location_suggested_name
if s.location_match in ("create_new",):
pass
else:
s.location_existing_id = None
s.location_match = "create_new"
selected.append(s)
apply_result = svc.apply_suggestions(db, selected, decided_by="operator")
# Invalidate the scan cache so the next /scan picks up the new state.
_SCAN_CACHE["at"] = 0.0
_SCAN_CACHE["result"] = None
return {
"applied": apply_result.applied,
"failed": [{"cluster_id": cid, "reason": r} for cid, r in apply_result.failed],
"not_found": not_found,
"project_ids_created": apply_result.project_ids_created,
"location_ids_created": apply_result.location_ids_created,
"assignment_ids_created": apply_result.assignment_ids_created,
}
@router.post("/skip")
async def skip(
request: Request,
db: Session = Depends(get_db),
):
"""Mark cluster_ids as skipped — they won't reappear in future scans."""
try:
body = await request.json()
except Exception:
raise HTTPException(status_code=400, detail="Invalid JSON body")
cluster_ids = body.get("cluster_ids") or []
if not isinstance(cluster_ids, list):
raise HTTPException(status_code=400, detail="cluster_ids must be a list")
n = svc.skip_clusters(db, cluster_ids, decided_by="operator")
_SCAN_CACHE["at"] = 0.0
_SCAN_CACHE["result"] = None
return {"skipped": n}