7ed94cd8fc08e8ad93b6f9f993650e2fc678e087
4 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
ad55d4ca09 |
fix(backfill): location matching over-confident on boilerplate-shared names
rapidfuzz.fuzz.WRatio inflates scores when two strings share substring
tokens, even when the shared tokens are common boilerplate. For
project names this is desirable (catches typos like '1-80' vs 'I-80')
but for location names it produces obvious false positives:
'Area 2 - Brookville Dam - Loc 2 East'
vs
'Area 1 - Loc 1 - 87 Jenks' → WRatio 85.5 (above 0.80 fuzzy threshold)
These share only 'area' + 'loc' + a digit but score 85%+ because WRatio
weights partial-substring overlap heavily. Operator reported the
backfill tool suggesting completely unrelated locations as 86% matches.
Fix: introduce `location_similarity()` — token_set_ratio + multi-digit
mismatch penalty. Used for location matching everywhere; WRatio stays
as the scorer for project names where its leniency is correct.
The multi-digit penalty (-0.30) triggers when both strings contain 2+-
digit numbers and none overlap. Catches the harder "same project,
different address identifier" case:
'Area 1 - Loc 2 - 68 Jenks' vs 'Area 1 - Loc 1 - 87 Jenks'
token_set_ratio = 0.91 (would still match without penalty)
multi-digit tokens {68} and {87} disjoint → -0.30 → 0.61 (rejected)
Single-digit tokens ('Loc 1', 'Area 2') are excluded from the penalty
because they're often coincidentally shared.
Updated:
- backend/services/metadata_backfill.py: new location_similarity()
function; _find_best_match() gains a `kind` parameter that selects
scorer; cluster-match call site passes kind='location'
- backend/routers/metadata_backfill.py: locations_search endpoint
(the typeahead dropdown's data source) uses location_similarity
instead of similarity for the same reason
Verified all six test cases land correctly:
- user-reported false positive: 0.85 → 0.59 (rejected)
- '87 Jenks' vs '68 Jenks': 0.90 → 0.61 (rejected)
- NRL-01 vs NRL-02: 0.83 → 0.53 (rejected)
- 'Loc 2 - 735 Bunola' vs 'Loc 2 735 Bunola Rd': 1.00 (still matches)
- punctuation-only difference: 1.00 (still matches)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
||
|
|
d46f9fccf8 |
fix(sfm): broaden Loc-N suffix regex to catch '.Loc' and 'Loc No.' variants
Operators use more separator variations than the original regex caught: - "Trumbull-Brayman-JV- Mont.Dam.Loc 2-R-25" — period as separator - "CMU - RKM Hall - Loc No. 3 - 4615 Forbes" — "No." between Loc and digit Added period to the separator character class and optional "No." token before the digit. Catches both above patterns plus near-variants without false-positives on normal project strings. Real-data impact: 5 more clusters now auto-strip cleanly, including the 1,903-event Trumbull-Brayman-JV- Mont.Dam cluster. Confidence distribution: 43 → 44 high. |
||
|
|
6ebbe28308 |
feat(sfm): strip "- Loc N" suffix from operator-typed project names
Operators sometimes bake location identifiers into the project string for email-readability — "Fay - Locks & Dam No3 - Loc 2 - 735 Bunola" where "Fay - Locks & Dam No3" is the actual project and "- Loc 2 - 735 Bunola" is location info that already lives in sensor_location. Without stripping, every "- Loc N" variant became a separate project, fragmenting what should be one project with several locations. Backend: - New _extract_project_root() helper. Regex matches " - Loc N" / "-Loc3" / " - Location #5" / etc. with case-insensitive multi-dash support; strips from that marker forward and cleans up dangling separators. Strings without a Loc-marker pass through unchanged. - Cluster dataclass adds project_root field alongside project_raw. project_raw stays the operator-typed string for display ("hover to see what was actually typed"). project_root is what gets normalised for matching and used as the suggested project name. - _ensure_project + _ensure_location now do normalisation-aware dedup before creating: a cluster of "SR81" and a cluster of "SR 81" (which normalise to the same string) collapse into one project on apply, even when applied in the same bulk operation. Avoids UNIQUE constraint collisions and duplicate-named-by-spacing projects. Frontend: - Wizard cluster cards show "↳ stripped trailing 'Loc N' suffix; operator typed: <raw>" when project_root differs from project_raw, so the operator can see at a glance what the parser did to the string. Real-data results: against the same 10,055 SFM events, confidence distribution improved from 37/14/8 (high/med/low) to 43/9/7. "Fay - Locks & Dam No3" now appears as ONE project across 6 cluster instances spanning 3 serials and 6 different locations — exactly the "one project, many locations" model the user described. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> |
||
|
|
42de06f441 |
feat(sfm): Phase 5a — bulk-backfill projects/locations/assignments from event metadata
Operator clicks one button. Parser reads SFM's events table (operator-typed
project / client / sensor_location strings), clusters by serial + time +
metadata, fuzzy-matches against existing projects, and proposes
Project / MonitoringLocation / UnitAssignment chains to create.
Auto-applies high-confidence non-conflicting clusters in bulk; queues
medium/low confidence for individual review.
Verified against real data: 10,052 events → 59 clusters → 37 high-
confidence + 14 medium + 8 low. Test-applied one cluster end-to-end;
Project + Module + Location + Assignment + UnitHistory + Decision rows
all created correctly, and Phase 2's attribution walk picked up the
events automatically on the new location's detail page.
Pipeline (backend/services/metadata_backfill.py, ~700 lines):
1. Pull all SFM events via /db/events per serial.
2. Pre-filter: drop events already covered by an existing UnitAssignment
window (Phase 2 handles those automatically).
3. Time-cluster what's left: serial + 7-day gap is the cluster identity.
4. Metadata-split each time-cluster on persistent metadata transitions
(≥ 2 consecutive events) so a single typo doesn't fork the cluster.
5. Match against existing graph (rapidfuzz.WRatio multi-signal scoring,
normalisation that handles abbreviations / reorders / separator
variations). Thresholds: 0.95 exact, 0.80 fuzzy, min-shorter-input
5 chars to guardrail false positives on single common words.
6. Score confidence (high/medium/low) using event count, span,
blank-meta, conflict, ambiguity rules.
7. Detect conflicts: overlap with existing UnitAssignment at a different
location for the same serial → blocking. Operator must reconcile.
8. Apply: ensure auto_imported ProjectType exists, ensure
vibration_monitoring ProjectModule on the project, write
Project / MonitoringLocation / UnitAssignment / UnitHistory all in
one transaction.
Migration (backend/migrate_add_metadata_backfill.py): adds
unit_assignments.source column (default 'manual') and
metadata_backfill_decisions table. Idempotent, non-destructive.
API (backend/routers/metadata_backfill.py):
GET /api/admin/metadata_backfill/scan — clusters + suggestions
POST /api/admin/metadata_backfill/apply — bulk apply by cluster_ids
w/ optional per-cluster
project/location overrides
POST /api/admin/metadata_backfill/skip — mark skipped (persistent)
UI (templates/admin/metadata_backfill.html, accessible at
/settings/developer/metadata-backfill via the Developer tab of Settings):
- One-button "Run scan" entry.
- Summary KPI tiles (scanned / already attributed / pending / conflicts).
- "Apply all high-confidence" bulk button at the top — primary path.
- Per-cluster cards below with Apply / Skip / Preview event actions.
- Blank-meta clusters get inline input fields for operator-typed project +
location names before applying.
- Blocking-conflict clusters render with the conflicting assignment
information and a disabled Apply button.
- Live progress toast during apply.
- Reuses the Phase 1+2+4 event-detail modal for "Preview event" — operator
can sanity-check the BW report data against the cluster's sample event.
Dependencies: rapidfuzz==3.10.1 added to requirements.txt. Pre-built C
wheels for all platforms, ~5s docker build hit.
Phase 5b (deferred to next session): swap-detection daily background job,
notification inbox for auto-applied swaps, recently-applied audit view,
"Tidy" page for renaming/merging auto-created projects.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|