Commit Graph

5 Commits

Author SHA1 Message Date
serversdown ad55d4ca09 fix(backfill): location matching over-confident on boilerplate-shared names
rapidfuzz.fuzz.WRatio inflates scores when two strings share substring
tokens, even when the shared tokens are common boilerplate.  For
project names this is desirable (catches typos like '1-80' vs 'I-80')
but for location names it produces obvious false positives:

  'Area 2 - Brookville Dam - Loc 2 East'
        vs
  'Area 1 - Loc 1 - 87 Jenks'              → WRatio 85.5 (above 0.80 fuzzy threshold)

These share only 'area' + 'loc' + a digit but score 85%+ because WRatio
weights partial-substring overlap heavily.  Operator reported the
backfill tool suggesting completely unrelated locations as 86% matches.

Fix: introduce `location_similarity()` — token_set_ratio + multi-digit
mismatch penalty.  Used for location matching everywhere; WRatio stays
as the scorer for project names where its leniency is correct.

The multi-digit penalty (-0.30) triggers when both strings contain 2+-
digit numbers and none overlap.  Catches the harder "same project,
different address identifier" case:

  'Area 1 - Loc 2 - 68 Jenks' vs 'Area 1 - Loc 1 - 87 Jenks'
  token_set_ratio = 0.91 (would still match without penalty)
  multi-digit tokens {68} and {87} disjoint → -0.30 → 0.61 (rejected)

Single-digit tokens ('Loc 1', 'Area 2') are excluded from the penalty
because they're often coincidentally shared.

Updated:
- backend/services/metadata_backfill.py: new location_similarity()
  function; _find_best_match() gains a `kind` parameter that selects
  scorer; cluster-match call site passes kind='location'
- backend/routers/metadata_backfill.py: locations_search endpoint
  (the typeahead dropdown's data source) uses location_similarity
  instead of similarity for the same reason

Verified all six test cases land correctly:
- user-reported false positive:         0.85 → 0.59 (rejected)
- '87 Jenks' vs '68 Jenks':            0.90 → 0.61 (rejected)
- NRL-01 vs NRL-02:                    0.83 → 0.53 (rejected)
- 'Loc 2 - 735 Bunola' vs 'Loc 2 735 Bunola Rd':  1.00 (still matches)
- punctuation-only difference:          1.00 (still matches)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-15 04:10:48 +00:00
serversdown d5a0163852 feat(locations): soft-remove monitoring locations without destroying history
When a client drops a location from scope mid-project (e.g. the office
half of a museum+office monitoring job), operators couldn't previously
mark it as no-longer-active without either deleting it (which would
orphan historical events) or leaving it in the active list looking
deployable.  Now there's a proper middle ground.

Data model
- MonitoringLocation gets two new nullable columns:
  - removed_at      — NULL means active; set means soft-removed
  - removal_reason  — optional operator note
  Migration: backend/migrate_add_location_removed.py (idempotent)

Endpoints
- POST /api/projects/{p}/locations/{l}/remove
    Body: { effective_date?: ISO-datetime, reason?: str }
    Side effects (cascade):
      1. Closes active UnitAssignment rows at this location
         (assigned_until = effective_date, status = "completed")
      2. Cancels pending ScheduledActions at this location
      3. Marks location.removed_at = effective_date
    Returns counts of assignments closed + actions cancelled.
- POST /api/projects/{p}/locations/{l}/restore
    Clears removed_at + removal_reason.  Does NOT auto-reopen
    assignments — operator creates new ones if resuming monitoring.

Active-surface filters
- locations-json defaults to active-only; pass include_removed=true
  for historical / reporting views.  Schedule modal dropdowns now
  exclude removed locations automatically.
- Metadata-backfill fuzzy matcher excludes removed locations from
  proposed targets (don't want backfill creating new assignments at
  decommissioned locations).
- Vibration-summary per_location rollup includes removed locations
  (so historical event totals stay accurate) but tags each with
  removed_at so the UI can show a badge.

UI
- Project detail page's Monitoring Locations section now splits into:
    Active locations (full card with Assign / Edit / Remove / Delete)
    Removed locations (collapsed <details>, greyed cards, Restore button,
                       shows removal date + reason)
- New per-card "Remove" button → opens confirmation modal explaining
  the cascade, with optional effective-date (defaults to now,
  backdateable) and reason fields.
- Unit detail's SFM Events attribution cell shows a small "removed"
  badge next to historical attributions whose location is no longer
  active.  Same pattern in vibration_summary's top-locations list.
- Soft-removal indicator surfaced through the events_for_unit
  attribution payload as location_removed_at.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-14 22:22:40 +00:00
serversdown d3b5a3fd26 feat(sfm): inline typeahead override of project + location on each cluster card
Operator no longer has to accept the parser's suggested project /
location verbatim.  Each cluster card now has editable typeahead inputs
that search existing projects (and existing locations within the chosen
project), with a "Create new: <typed>" fallback always available.

Solves the I-80-North-Fork case: of the 20+ cluster variants
("I-80-North Fork Bridges-I80 E. Abutment", "I-80- North Fork
Bridges-543 Plank Rd", etc.), operator types "I-80" in the Project
input, picks the existing project from the dropdown, and the cluster
attaches to it.  Repeat for the other variants.  No need to pre-create
the canonical project — though pre-creation still works fine if you'd
rather.

Backend (backend/routers/metadata_backfill.py):
- GET /api/admin/metadata_backfill/projects_search?q=&limit=
  Returns existing projects matching by case-insensitive substring OR
  rapidfuzz WRatio score >= 0.50.  Substring matches sort to the top
  (treated as exact for ordering).  Includes location_count and
  project_number/client_name in each result for disambiguation.  Always
  emits a "Create new: <q>" suggestion alongside the matches.

- GET /api/admin/metadata_backfill/locations_search?project_id=&q=&limit=
  Same shape, scoped to a single project's vibration locations.

- POST /api/admin/metadata_backfill/apply now accepts four override
  keys per cluster (was previously two):
    project_id       → attach to existing Project (operator picked from
                       typeahead)
    project_name     → create new with this name (operator typed a
                       custom name; existing project_name behaviour)
    location_id      → attach to existing MonitoringLocation; validated
                       against the chosen project_id so a stale location
                       FK can't sneak in
    location_name    → create new location with this name

Frontend (templates/admin/metadata_backfill.html):
- Each non-blank-meta cluster card now has two editable typeahead inputs
  (Project + Location) pre-populated with the parser's suggested
  values.  Old static "Project: + Create new: X" / "≈ Fuzzy match" pills
  replaced with compact hint lines under the inputs showing what the
  current value will do.
- Typeahead dropdown opens on focus, debounced 150ms on type.  Shows
  matched existing entities with score badges (exact / NN%) plus a
  "Create new: <typed>" option at the bottom.  Click-to-pick fills the
  text input and writes the entity id into a hidden field.
- Picking a new project clears the location id (forces re-pick under
  the new project, avoids cross-project location FKs).
- _gatherOverrides re-wired to emit the new project_id / location_id
  keys when the operator picked from the dropdown, falling back to
  *_name when they typed free-form.

Backward-compatible: blank-meta clusters keep their existing "project_name
/ location_name" plain inputs and the override path still honours them.

Verified end-to-end:
- /projects_search?q=I-80 returns the existing "I-80 - North Fork
  Bridge" project (score 1.0, has 4 locations) plus a "Create new"
  option.
- /locations_search requires project_id (400 without it).
- Wizard page renders with typeahead wiring confirmed in HTML.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 19:48:09 +00:00
serversdown 6ebbe28308 feat(sfm): strip "- Loc N" suffix from operator-typed project names
Operators sometimes bake location identifiers into the project string
for email-readability — "Fay - Locks & Dam No3 - Loc 2 - 735 Bunola"
where "Fay - Locks & Dam No3" is the actual project and "- Loc 2 -
735 Bunola" is location info that already lives in sensor_location.
Without stripping, every "- Loc N" variant became a separate project,
fragmenting what should be one project with several locations.

Backend:
- New _extract_project_root() helper.  Regex matches " - Loc N" / "-Loc3" /
  " - Location #5" / etc. with case-insensitive multi-dash support; strips
  from that marker forward and cleans up dangling separators.  Strings
  without a Loc-marker pass through unchanged.

- Cluster dataclass adds project_root field alongside project_raw.
  project_raw stays the operator-typed string for display ("hover to see
  what was actually typed").  project_root is what gets normalised for
  matching and used as the suggested project name.

- _ensure_project + _ensure_location now do normalisation-aware dedup
  before creating: a cluster of "SR81" and a cluster of "SR 81" (which
  normalise to the same string) collapse into one project on apply,
  even when applied in the same bulk operation.  Avoids UNIQUE
  constraint collisions and duplicate-named-by-spacing projects.

Frontend:
- Wizard cluster cards show "↳ stripped trailing 'Loc N' suffix; operator
  typed: <raw>" when project_root differs from project_raw, so the
  operator can see at a glance what the parser did to the string.

Real-data results: against the same 10,055 SFM events, confidence
distribution improved from 37/14/8 (high/med/low) to 43/9/7.  "Fay -
Locks & Dam No3" now appears as ONE project across 6 cluster instances
spanning 3 serials and 6 different locations — exactly the
"one project, many locations" model the user described.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 16:49:14 +00:00
serversdown 42de06f441 feat(sfm): Phase 5a — bulk-backfill projects/locations/assignments from event metadata
Operator clicks one button.  Parser reads SFM's events table (operator-typed
project / client / sensor_location strings), clusters by serial + time +
metadata, fuzzy-matches against existing projects, and proposes
Project / MonitoringLocation / UnitAssignment chains to create.
Auto-applies high-confidence non-conflicting clusters in bulk; queues
medium/low confidence for individual review.

Verified against real data: 10,052 events → 59 clusters → 37 high-
confidence + 14 medium + 8 low.  Test-applied one cluster end-to-end;
Project + Module + Location + Assignment + UnitHistory + Decision rows
all created correctly, and Phase 2's attribution walk picked up the
events automatically on the new location's detail page.

Pipeline (backend/services/metadata_backfill.py, ~700 lines):
  1. Pull all SFM events via /db/events per serial.
  2. Pre-filter: drop events already covered by an existing UnitAssignment
     window (Phase 2 handles those automatically).
  3. Time-cluster what's left: serial + 7-day gap is the cluster identity.
  4. Metadata-split each time-cluster on persistent metadata transitions
     (≥ 2 consecutive events) so a single typo doesn't fork the cluster.
  5. Match against existing graph (rapidfuzz.WRatio multi-signal scoring,
     normalisation that handles abbreviations / reorders / separator
     variations).  Thresholds: 0.95 exact, 0.80 fuzzy, min-shorter-input
     5 chars to guardrail false positives on single common words.
  6. Score confidence (high/medium/low) using event count, span,
     blank-meta, conflict, ambiguity rules.
  7. Detect conflicts: overlap with existing UnitAssignment at a different
     location for the same serial → blocking.  Operator must reconcile.
  8. Apply: ensure auto_imported ProjectType exists, ensure
     vibration_monitoring ProjectModule on the project, write
     Project / MonitoringLocation / UnitAssignment / UnitHistory all in
     one transaction.

Migration (backend/migrate_add_metadata_backfill.py): adds
unit_assignments.source column (default 'manual') and
metadata_backfill_decisions table.  Idempotent, non-destructive.

API (backend/routers/metadata_backfill.py):
  GET  /api/admin/metadata_backfill/scan          — clusters + suggestions
  POST /api/admin/metadata_backfill/apply         — bulk apply by cluster_ids
                                                     w/ optional per-cluster
                                                     project/location overrides
  POST /api/admin/metadata_backfill/skip          — mark skipped (persistent)

UI (templates/admin/metadata_backfill.html, accessible at
/settings/developer/metadata-backfill via the Developer tab of Settings):
  - One-button "Run scan" entry.
  - Summary KPI tiles (scanned / already attributed / pending / conflicts).
  - "Apply all high-confidence" bulk button at the top — primary path.
  - Per-cluster cards below with Apply / Skip / Preview event actions.
  - Blank-meta clusters get inline input fields for operator-typed project +
    location names before applying.
  - Blocking-conflict clusters render with the conflicting assignment
    information and a disabled Apply button.
  - Live progress toast during apply.
  - Reuses the Phase 1+2+4 event-detail modal for "Preview event" — operator
    can sanity-check the BW report data against the cluster's sample event.

Dependencies: rapidfuzz==3.10.1 added to requirements.txt.  Pre-built C
wheels for all platforms, ~5s docker build hit.

Phase 5b (deferred to next session): swap-detection daily background job,
notification inbox for auto-applied swaps, recently-applied audit view,
"Tidy" page for renaming/merging auto-created projects.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-12 05:54:57 +00:00