fix(backfill): location matching over-confident on boilerplate-shared names

rapidfuzz.fuzz.WRatio inflates scores when two strings share substring
tokens, even when the shared tokens are common boilerplate.  For
project names this is desirable (catches typos like '1-80' vs 'I-80')
but for location names it produces obvious false positives:

  'Area 2 - Brookville Dam - Loc 2 East'
        vs
  'Area 1 - Loc 1 - 87 Jenks'              → WRatio 85.5 (above 0.80 fuzzy threshold)

These share only 'area' + 'loc' + a digit but score 85%+ because WRatio
weights partial-substring overlap heavily.  Operator reported the
backfill tool suggesting completely unrelated locations as 86% matches.

Fix: introduce `location_similarity()` — token_set_ratio + multi-digit
mismatch penalty.  Used for location matching everywhere; WRatio stays
as the scorer for project names where its leniency is correct.

The multi-digit penalty (-0.30) triggers when both strings contain 2+-
digit numbers and none overlap.  Catches the harder "same project,
different address identifier" case:

  'Area 1 - Loc 2 - 68 Jenks' vs 'Area 1 - Loc 1 - 87 Jenks'
  token_set_ratio = 0.91 (would still match without penalty)
  multi-digit tokens {68} and {87} disjoint → -0.30 → 0.61 (rejected)

Single-digit tokens ('Loc 1', 'Area 2') are excluded from the penalty
because they're often coincidentally shared.

Updated:
- backend/services/metadata_backfill.py: new location_similarity()
  function; _find_best_match() gains a `kind` parameter that selects
  scorer; cluster-match call site passes kind='location'
- backend/routers/metadata_backfill.py: locations_search endpoint
  (the typeahead dropdown's data source) uses location_similarity
  instead of similarity for the same reason

Verified all six test cases land correctly:
- user-reported false positive:         0.85 → 0.59 (rejected)
- '87 Jenks' vs '68 Jenks':            0.90 → 0.61 (rejected)
- NRL-01 vs NRL-02:                    0.83 → 0.53 (rejected)
- 'Loc 2 - 735 Bunola' vs 'Loc 2 735 Bunola Rd':  1.00 (still matches)
- punctuation-only difference:          1.00 (still matches)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This commit is contained in:

serversdown

2026-05-15 04:10:48 +00:00

parent ba1f28ee53

commit ad55d4ca09

2 changed files with 65 additions and 3 deletions

									
										backend/routers/metadata_backfill.py
									
		+5
		-1
	
												View File
												
				@@ -376,7 +376,11 @@ def locations_search(

				        if q_norm in l_norm:

				            scored.append((l, 1.0))

				            continue

				        score = svc.similarity(q_norm, l_norm)

				        # Use the location-specific scorer (token_set_ratio + multi-digit

				        # penalty) instead of WRatio — same reason as the cluster-match

				        # path: location names share too much boilerplate vocabulary for

				        # WRatio to discriminate reliably.

				        score = svc.location_similarity(q_norm, l_norm)

				        if score >= 0.50:

				            scored.append((l, score))