fix(sfm): broaden Loc-N suffix regex to catch '.Loc' and 'Loc No.' variants
Operators use more separator variations than the original regex caught: - "Trumbull-Brayman-JV- Mont.Dam.Loc 2-R-25" — period as separator - "CMU - RKM Hall - Loc No. 3 - 4615 Forbes" — "No." between Loc and digit Added period to the separator character class and optional "No." token before the digit. Catches both above patterns plus near-variants without false-positives on normal project strings. Real-data impact: 5 more clusters now auto-strip cleanly, including the 1,903-event Trumbull-Brayman-JV- Mont.Dam cluster. Confidence distribution: 43 → 44 high.
This commit is contained in:
@@ -104,15 +104,19 @@ def _normalise(s: Optional[str]) -> str:
|
|||||||
_PROJECT_LOC_SUFFIX = re.compile(
|
_PROJECT_LOC_SUFFIX = re.compile(
|
||||||
r"""
|
r"""
|
||||||
\s* # any leading whitespace
|
\s* # any leading whitespace
|
||||||
[-–—] # hyphen or em-dash (separator before the Loc marker)
|
[-–—.] # separator: hyphen, em-dash, or period
|
||||||
\s* # optional spaces
|
# (operators use any of these — see
|
||||||
|
# "Mont.Dam.Loc 2-R-25")
|
||||||
|
\s*
|
||||||
(?:loc|location) # 'Loc' or 'Location'
|
(?:loc|location) # 'Loc' or 'Location'
|
||||||
\.? # optional period
|
\.? # optional trailing period after Loc
|
||||||
\s* # optional space
|
\s*
|
||||||
|
(?:no\.?\s*)? # optional "No." or "No " before the digit
|
||||||
|
# (e.g. "Loc No. 3", "Loc No 5")
|
||||||
\#? # optional '#'
|
\#? # optional '#'
|
||||||
\s* # optional space
|
\s*
|
||||||
\d+ # required digit
|
\d+ # required digit
|
||||||
\b # word boundary
|
\b
|
||||||
""",
|
""",
|
||||||
re.IGNORECASE | re.VERBOSE,
|
re.IGNORECASE | re.VERBOSE,
|
||||||
)
|
)
|
||||||
|
|||||||
Reference in New Issue
Block a user