fix(backfill): location matching over-confident on boilerplate-shared names
rapidfuzz.fuzz.WRatio inflates scores when two strings share substring
tokens, even when the shared tokens are common boilerplate. For
project names this is desirable (catches typos like '1-80' vs 'I-80')
but for location names it produces obvious false positives:
'Area 2 - Brookville Dam - Loc 2 East'
vs
'Area 1 - Loc 1 - 87 Jenks' → WRatio 85.5 (above 0.80 fuzzy threshold)
These share only 'area' + 'loc' + a digit but score 85%+ because WRatio
weights partial-substring overlap heavily. Operator reported the
backfill tool suggesting completely unrelated locations as 86% matches.
Fix: introduce `location_similarity()` — token_set_ratio + multi-digit
mismatch penalty. Used for location matching everywhere; WRatio stays
as the scorer for project names where its leniency is correct.
The multi-digit penalty (-0.30) triggers when both strings contain 2+-
digit numbers and none overlap. Catches the harder "same project,
different address identifier" case:
'Area 1 - Loc 2 - 68 Jenks' vs 'Area 1 - Loc 1 - 87 Jenks'
token_set_ratio = 0.91 (would still match without penalty)
multi-digit tokens {68} and {87} disjoint → -0.30 → 0.61 (rejected)
Single-digit tokens ('Loc 1', 'Area 2') are excluded from the penalty
because they're often coincidentally shared.
Updated:
- backend/services/metadata_backfill.py: new location_similarity()
function; _find_best_match() gains a `kind` parameter that selects
scorer; cluster-match call site passes kind='location'
- backend/routers/metadata_backfill.py: locations_search endpoint
(the typeahead dropdown's data source) uses location_similarity
instead of similarity for the same reason
Verified all six test cases land correctly:
- user-reported false positive: 0.85 → 0.59 (rejected)
- '87 Jenks' vs '68 Jenks': 0.90 → 0.61 (rejected)
- NRL-01 vs NRL-02: 0.83 → 0.53 (rejected)
- 'Loc 2 - 735 Bunola' vs 'Loc 2 735 Bunola Rd': 1.00 (still matches)
- punctuation-only difference: 1.00 (still matches)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -376,7 +376,11 @@ def locations_search(
|
||||
if q_norm in l_norm:
|
||||
scored.append((l, 1.0))
|
||||
continue
|
||||
score = svc.similarity(q_norm, l_norm)
|
||||
# Use the location-specific scorer (token_set_ratio + multi-digit
|
||||
# penalty) instead of WRatio — same reason as the cluster-match
|
||||
# path: location names share too much boilerplate vocabulary for
|
||||
# WRatio to discriminate reliably.
|
||||
score = svc.location_similarity(q_norm, l_norm)
|
||||
if score >= 0.50:
|
||||
scored.append((l, score))
|
||||
|
||||
|
||||
Reference in New Issue
Block a user