feat(projects): Tidy page for fuzzy-detecting + bulk-merging duplicate projects
Phase 5b first slice. Surfaces near-duplicate projects (typo variants, abbreviation differences, spacing variations like "SR81" vs "SR 81") as side-by-side pairs the operator can merge with one click. Backend (backend/services/project_tidy.py): - find_duplicate_pairs(db, threshold=0.85) walks all active projects and computes rapidfuzz.WRatio similarity for every pair. Pre-filters too-short normalised names (< 4 chars) to avoid noise. Skips soft-deleted projects. Returns pairs sorted by score desc, then by total content (more assignments → review first). - Each pair carries a suggested merge target with a human-readable reason. Priorities (in order): manual source over parser source, populated project_number, more locations, more assignments, shorter name. Operator can override the suggestion by clicking the OTHER direction button. - O(N^2) over project count. Fine up to ~500 projects. Token-prefix blocking is the obvious next optimisation if it becomes slow. Backend (backend/routers/projects.py): - GET /api/projects/admin/duplicate_pairs?threshold=&max_pairs= returns pairs as JSON for the Tidy page. Frontend (templates/admin/project_tidy.html): - New admin page at /settings/developer/project-tidy. Threshold selector (95% / 90% / 85% / 80%) at the top; rescan button next to it; auto- scans on load. - Each pair card shows side-by-side project summaries (name, project_ number, client, source-badge, location/assignment counts) with the suggested target visually highlighted (orange border). Three buttons: "Merge A → B", "Merge B → A", "Not a dup" (hide locally). - Click-to-merge opens a native confirm with the preview totals (assignments/sessions/data files moving, consolidations) — same data the project_header.html merge modal shows. On confirm, hits the existing /merge_into endpoint and re-scans automatically. - Source badges distinguish parser-created (`metadata_backfill`) from manual projects — at a glance the operator can see "this duplicate is parser-generated; safe to merge into the manual one". Frontend (templates/admin/metadata_backfill.html): - Apply-result handling now surfaces failed[] cluster reasons in a dedicated failure panel (bottom-left, dismissable). Previously a 200 OK with all-failures showed a misleading "1 cluster applied" success toast because the count and the failure list weren't being reconciled. This bit us during the DB-revert recovery earlier — the project_modules table was missing, every apply silently rolled back, user saw success toasts. Fixed. Smoke-verified against current state (10K events, 9 projects, post- merge): tool correctly finds 0 pairs at threshold 0.85 (data is clean), 1 false-positive at 0.70 (two unrelated projects sharing the token "81" — example of why the 0.85 default is correct). Settings link added under Developer → Project Tidy. Phase 5c (swap-detection daily background job + notification inbox) remains deferred to the next session. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -0,0 +1,235 @@
|
||||
"""
|
||||
project_tidy.py — find duplicate-looking projects + offer bulk merge.
|
||||
|
||||
The metadata-backfill parser is good at clustering events into candidate
|
||||
projects but doesn't compare its proposed project names against EACH OTHER
|
||||
(it only checks against existing terra-view projects). After a bulk
|
||||
apply, you can end up with many near-duplicate projects — typo variants,
|
||||
abbreviation differences, etc. This module surfaces them as pairs the
|
||||
operator can merge.
|
||||
|
||||
Pairs vs clusters: a fully-connected group like (A, B, C) where each pair
|
||||
scores >= threshold becomes 3 pairs. The operator has to do 2 merges to
|
||||
fully consolidate. We don't try to be smarter about transitive grouping —
|
||||
in practice operators want to review the highest-similarity pair first
|
||||
anyway, and the list re-computes after each merge.
|
||||
|
||||
Public API:
|
||||
find_duplicate_pairs(db, *, threshold=0.85, max_pairs=200) → list[DuplicatePair]
|
||||
"""
|
||||
|
||||
from __future__ import annotations
|
||||
|
||||
import logging
|
||||
from dataclasses import dataclass
|
||||
from typing import Optional
|
||||
|
||||
import rapidfuzz
|
||||
from sqlalchemy import func
|
||||
from sqlalchemy.orm import Session
|
||||
|
||||
from backend.models import (
|
||||
Project,
|
||||
MonitoringLocation,
|
||||
UnitAssignment,
|
||||
)
|
||||
from backend.services.metadata_backfill import _normalise as _meta_normalise
|
||||
|
||||
log = logging.getLogger("backend.services.project_tidy")
|
||||
|
||||
|
||||
DEFAULT_THRESHOLD = 0.85 # WRatio similarity above which we surface a pair
|
||||
DEFAULT_MAX_PAIRS = 200 # Cap the result list to keep response small
|
||||
MIN_NORMALISED_LENGTH = 4 # Skip projects whose normalised name is too short
|
||||
# to fuzzy-match safely (avoids "1" / "1" pairs).
|
||||
|
||||
|
||||
@dataclass
|
||||
class ProjectSummary:
|
||||
id: str
|
||||
name: str
|
||||
project_number: Optional[str]
|
||||
client_name: Optional[str]
|
||||
source: str # 'manual' | 'metadata_backfill' | ...
|
||||
status: str
|
||||
location_count: int
|
||||
assignment_count: int
|
||||
event_count_total: int # approx — sum across assignments
|
||||
|
||||
|
||||
@dataclass
|
||||
class DuplicatePair:
|
||||
a: ProjectSummary
|
||||
b: ProjectSummary
|
||||
score: float
|
||||
suggested_target_id: str # the recommended "keep" side
|
||||
reason: str # why we picked that target
|
||||
|
||||
|
||||
# ── Helpers ──────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def _normalise_project_name(name: str) -> str:
|
||||
"""Project-name normalisation for tidy comparison.
|
||||
|
||||
Reuses the metadata_backfill normaliser (lowercase, punctuation→space,
|
||||
collapse whitespace). Returns "" for None or all-punctuation names.
|
||||
"""
|
||||
return _meta_normalise(name)
|
||||
|
||||
|
||||
def _summarise_projects(db: Session) -> list[ProjectSummary]:
|
||||
"""One row per active project with cached counts. Excludes deleted."""
|
||||
projects = (
|
||||
db.query(Project)
|
||||
.filter(Project.status != "deleted")
|
||||
.all()
|
||||
)
|
||||
|
||||
# Bulk lookup: assignment counts + location counts per project.
|
||||
loc_counts: dict[str, int] = dict(
|
||||
db.query(MonitoringLocation.project_id, func.count(MonitoringLocation.id))
|
||||
.filter(MonitoringLocation.project_id.in_([p.id for p in projects]) if projects else False)
|
||||
.group_by(MonitoringLocation.project_id)
|
||||
.all()
|
||||
)
|
||||
asgn_counts: dict[str, int] = dict(
|
||||
db.query(UnitAssignment.project_id, func.count(UnitAssignment.id))
|
||||
.filter(UnitAssignment.project_id.in_([p.id for p in projects]) if projects else False)
|
||||
.group_by(UnitAssignment.project_id)
|
||||
.all()
|
||||
)
|
||||
|
||||
summaries: list[ProjectSummary] = []
|
||||
for p in projects:
|
||||
summaries.append(ProjectSummary(
|
||||
id = p.id,
|
||||
name = p.name,
|
||||
project_number = p.project_number,
|
||||
client_name = p.client_name,
|
||||
source = None, # filled below per assignment
|
||||
status = p.status or "active",
|
||||
location_count = loc_counts.get(p.id, 0),
|
||||
assignment_count = asgn_counts.get(p.id, 0),
|
||||
event_count_total = 0, # not cheap to compute here; left 0
|
||||
))
|
||||
|
||||
# Determine each project's dominant assignment source. Used to break ties
|
||||
# when picking the "keep" target — prefer manual over parser-created.
|
||||
rows = (
|
||||
db.query(UnitAssignment.project_id, UnitAssignment.source, func.count(UnitAssignment.id))
|
||||
.group_by(UnitAssignment.project_id, UnitAssignment.source)
|
||||
.all()
|
||||
)
|
||||
by_proj_src: dict[str, dict[str, int]] = {}
|
||||
for proj_id, src, cnt in rows:
|
||||
by_proj_src.setdefault(proj_id, {})[src or "manual"] = cnt
|
||||
for s in summaries:
|
||||
src_map = by_proj_src.get(s.id, {})
|
||||
if not src_map:
|
||||
s.source = "manual"
|
||||
else:
|
||||
# Dominant source (most assignments).
|
||||
s.source = max(src_map.items(), key=lambda kv: kv[1])[0]
|
||||
|
||||
return summaries
|
||||
|
||||
|
||||
def _pick_target(a: ProjectSummary, b: ProjectSummary) -> tuple[str, str]:
|
||||
"""Decide which project should be the merge target (the one we keep).
|
||||
|
||||
Priorities (in order):
|
||||
1. The one with `source='manual'` over `source='metadata_backfill'`
|
||||
— operator-curated projects beat parser-created ones.
|
||||
2. The one with a populated `project_number`.
|
||||
3. The one with more locations (more curation history).
|
||||
4. The one with more assignments.
|
||||
5. The one with the shorter, cleaner name (tiebreaker).
|
||||
|
||||
Returns (target_id, reason_string).
|
||||
"""
|
||||
# 1. Source provenance.
|
||||
a_manual = a.source == "manual"
|
||||
b_manual = b.source == "manual"
|
||||
if a_manual and not b_manual:
|
||||
return a.id, "A is manually-created; B is parser-created"
|
||||
if b_manual and not a_manual:
|
||||
return b.id, "B is manually-created; A is parser-created"
|
||||
|
||||
# 2. project_number populated.
|
||||
if a.project_number and not b.project_number:
|
||||
return a.id, "A has a project_number; B doesn't"
|
||||
if b.project_number and not a.project_number:
|
||||
return b.id, "B has a project_number; A doesn't"
|
||||
|
||||
# 3. More locations.
|
||||
if a.location_count > b.location_count:
|
||||
return a.id, f"A has more locations ({a.location_count} vs {b.location_count})"
|
||||
if b.location_count > a.location_count:
|
||||
return b.id, f"B has more locations ({b.location_count} vs {a.location_count})"
|
||||
|
||||
# 4. More assignments.
|
||||
if a.assignment_count > b.assignment_count:
|
||||
return a.id, f"A has more assignments ({a.assignment_count} vs {b.assignment_count})"
|
||||
if b.assignment_count > a.assignment_count:
|
||||
return b.id, f"B has more assignments ({b.assignment_count} vs {a.assignment_count})"
|
||||
|
||||
# 5. Shorter name (less likely to have baked-in junk).
|
||||
if len(a.name) <= len(b.name):
|
||||
return a.id, "A has the shorter / cleaner name"
|
||||
return b.id, "B has the shorter / cleaner name"
|
||||
|
||||
|
||||
# ── Public ───────────────────────────────────────────────────────────────────
|
||||
|
||||
|
||||
def find_duplicate_pairs(
|
||||
db: Session,
|
||||
*,
|
||||
threshold: float = DEFAULT_THRESHOLD,
|
||||
max_pairs: int = DEFAULT_MAX_PAIRS,
|
||||
) -> list[DuplicatePair]:
|
||||
"""Compute all project-pair similarities above `threshold`.
|
||||
|
||||
O(N^2) over the project count — fine up to ~500 projects; beyond that
|
||||
we'd want a blocked / token-indexed approach. In practice
|
||||
`metadata_backfill` projects tend to share tokens, so a simple
|
||||
pre-filter (skip pairs that share NO tokens) would cheaply cut the
|
||||
inner loop. Deferred until profiling motivates it.
|
||||
"""
|
||||
summaries = _summarise_projects(db)
|
||||
|
||||
# Pre-compute normalised names; skip too-short ones.
|
||||
norm_by_id: dict[str, str] = {}
|
||||
candidates: list[ProjectSummary] = []
|
||||
for s in summaries:
|
||||
n = _normalise_project_name(s.name)
|
||||
if len(n) < MIN_NORMALISED_LENGTH:
|
||||
continue
|
||||
norm_by_id[s.id] = n
|
||||
candidates.append(s)
|
||||
|
||||
pairs: list[DuplicatePair] = []
|
||||
n = len(candidates)
|
||||
for i in range(n):
|
||||
a = candidates[i]
|
||||
a_norm = norm_by_id[a.id]
|
||||
for j in range(i + 1, n):
|
||||
b = candidates[j]
|
||||
b_norm = norm_by_id[b.id]
|
||||
score = rapidfuzz.fuzz.WRatio(a_norm, b_norm) / 100.0
|
||||
if score < threshold:
|
||||
continue
|
||||
target_id, reason = _pick_target(a, b)
|
||||
pairs.append(DuplicatePair(
|
||||
a = a,
|
||||
b = b,
|
||||
score = score,
|
||||
suggested_target_id = target_id,
|
||||
reason = reason,
|
||||
))
|
||||
|
||||
# Sort by score desc, then by total content (more data → review first).
|
||||
pairs.sort(key=lambda p: (-p.score, -(p.a.assignment_count + p.b.assignment_count)))
|
||||
|
||||
return pairs[:max_pairs]
|
||||
Reference in New Issue
Block a user