3

I'm unfortunately in a position where I need to parse out individual address elements from data that was originally entered into a single field in a legacy database. Now I'm attempting to normalize this data (for historical reporting and such) for all future development projects in our company by splitting the information into the appropriate fields (street address, city, state, ZIP code). However, some of the addresses I have to deal with are nothing but legal descriptions of the property - e.g., Lot 1, Block 1, Somecity, ST, 74999 or S1-T1-R1, Anothercity, St, 74998

I believe I've actually gotten most of the standard address information parsing working fairly well, but there's one piece that I'm not sure how to accomplish, mainly because of my relative unfamiliarity with RegEx. What I'd like to do is extract out these legal descriptions before attempting to parse any other information from the field's remaining value. I know what my pattern will "look like", but I'm not sure how to construct the actual RegEx pattern.

I know that I'll be working with some capture groups, but I'm totally lost on how to set this up to match what I want to see. I've tried to play with something like ^((L(\w\s)*\d+(\w*\d*\s*)?)(B(\w\s)*\d+(\w*\d*\s*)?)?\,), but that's obviously not it (see my testing on Regexr). I simply don't understand the RegEx syntax well enough yet to know what I'm doing "wrong".

Here are the pattern rules I want to employ with RegEx. I realize that these will likely be two separate "passes" on the data - once to look for the LOT/BLOCK notation, then again for the SECTION/TOWNSHIP/RANGE notation - but that's totally fine. I also understand that nothing will be absolutely 100% accurate in picking up the data and I'll be dealing with some "false" matches, or even "missed" matches that will have to be dealt with manually, but getting this RegEx to extract what I can will be a tremendous head start on normalizing my data. At this point, I'm simply trying to eliminate any "extra" elements from the address before parsing the rest of the data.

The RegEx should only find these patterns at the very beginning of a trimmed string (no leading or trailing spaces) as the legal description - at least in this particular database - will not be found "later" in the address string than at the very beginning.


MATCHING RULES - LOT/BLOCK LEGAL DESCRIPTION

  • The first character of the trimmed, uppercase string is "L"
    • In this order, the "L" is followed by:
      • Zero or more alpha characters or spaces (e.g., "LOT", "LOTS", "LT", etc.)
      • Must include one or more numeric values
    • There may be another number, a comma, dash or space after the first numeric character (or some combination of these characters)
  • The above may be followed by a "B" character
    • In this order, the "B" is followed by:
      • Zero or more alpha characters or spaces (e.g., "BLK", "BLOCK", "BLCK", etc.)
      • Must include one or more numeric values
  • The legal description should terminate either with a comma (indicating the next address element) or at the end of the string. Any characters between the "BLOCK" group and this termination should be included in the legal description (a trailing comma should not be included).

MATCHING RULES - SECTION/TOWNSHIP/RANGE LEGAL DESCRIPTION

  • The first character of the trimmed, uppercase string is "S"
    • In this order, the "S" will be followed by:
      • Zero or more alpha characters or spaces (e.g., "SECTION", "SEC", "SECT", etc.)
      • One or more numeric values
    • There may be another number, a comma, dash or space after the first numeric character
  • The above may be followed by a "T" character
    • If present, in this order, the "T" will be followed by:
      • Zero or more alpha characters or spaces (e.g., "TWNSHP", "TOWN", "TWN", etc.)
      • Must include one or more numeric values
      • Zero or more alpha characters or spaces
    • If the "T" character is not present and the string is not terminated (see below), the next character to find should be a number
  • The above may be followed by an "R" character
    • In this order, the "R" will be followed by:
      • Zero or more alpha characters or spaces (e.g., "RANGE", "RNG", "RG", etc.)
      • Must include one or more numeric values
      • Zero or more alpha characters or spaces
    • If the "R" character is not present and the string is not terminated (see below), the next character to find should be a number
  • The legal description should terminate either with a comma (indicating the next address element) or at the end of the string. Any characters between the "RANGE" group and this termination should be included in the legal description (a trailing comma should not be included).

Here is some (slightly obfuscated) sample data I'm working with:


SAMPLE DATA - LOT/BLOCK LEGAL DESCRIPTION

1.    L2 B64,SOMECITY ,OK,74999
2.    L2 B4 RHODA 1 ADDN,SOMECITY,OK,74999 - 81 HILLCREST MH
3.    L20-22 B10 LETCHERS ADDN,SOMECITY,OK,74999
4.    L2 S10-13-18,SOMECITY,OK,74999
5.    L23&24 B10, SOMECITY, OK, 74999
6.    L21 OAKMONT ADD, SOMECITY OK, 74999
7.    L24-30 B42,SOMECITY,OK,74999
8.    L24 DOGWOOD ESTATES
9.    L27 B2 LAKE RHONDA, 82 SKYLINE MH
10.   L3 B2 STONEBROOK II ADDN, SOMECITY, OK, 74999
11.   LOT 22 BLOCK 1 OF BEACON,SOMECITY,OK,74999
12.   LOT 44-45 WILLIAM DR,SOMECITY,OK,74999
13.   L 10,11 B17,SOMECITY,OK
14.   L 8 B 4 HISEL EST.

The results should be:

1.    L2 B64
2.    L2 B4 RHODA 1 ADDN
3.    L20-22 B10 LETCHERS ADDN
4.    L2 S10-13-18
5.    L23&24 B10
6.    L21 OAKMONT ADD
7.    L24-30 B42
8.    L24 DOGWOOD ESTATES
9.    L27 B2 LAKE RHONDA
10.   L3 B2 STONEBROOK II ADDN
11.   LOT 22 BLOCK 1 OF BEACON
12.   LOT 44-45 WILLIAM DR
13.   L 10,11 B17
14.   L 8 B 4 HISEL EST.

SAMPLE DATA - SECTION/TOWNSHIP/RANGE LEGAL DESCRIPTION

1.    S18-31-21,ANOTHERCITY,OK,74998
2.    S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO
3.    S20-T12N-R16E-MCINTOSH CO,ANOTHERCITY,OK,74998
4.    S20,T12N,R19E,MUSKOGEE CO
5.    S2,T15,R25, 86 REGAL 14X60 MH CFC2086F4KL100000
6.    S18-31-21,ANOTHERCITY,OK,74998
7.    S21 T6N R1E
8.    S21 T6N R1E,ANOTHERCITY,OK,74998
9.    SEC 33 TWP 19 RNG 19-BARN,ANOTHERCITY,OK,74998
10.   SECT 1-22-22 METAL BARN,ANOTHERCITY,OK,74998

The results should be:

1.    S18-31-21
2.    S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO
3.    S20-T12N-R16E-MCINTOSH CO
4.    S20,T12N,R19E
5.    S2,T15,R25
6.    S18-31-21
7.    S21 T6N R1E
8.    S21 T6N R1E
9.    SEC 33 TWP 19 RNG 19-BARN
10.   SECT 1-22-22 METAL BARN

If you require any additional information, please feel free to let me know. As I stated above, I realize that RegEx isn't going to be a 100% accurate "cure-all" for what I need to do, but I'm really hoping that it will at least give me a significant head start so I don't have to clean up hundreds of thousands of records by hand.

G_Hosa_Phat
  • 976
  • 2
  • 18
  • 38
  • These legal descriptions. First of all, you haven't provided concrete sequential information. Part of the sequence should identify _parallel_ requirements. You can't list `Zero or more alpha characters (e.g., "LOT", "LOTS", "LT", etc.)` followed by `Zero or more spaces`, these are parallel requirements that are OR's in combination. Try to structure the requirements so that they leave _no doubt_ as to the position sequence and it's optional state. –  Jul 24 '20 at 18:57
  • Thank you for requesting clarification, @Maxt8r. Perhaps this would be stated more clearly as `Zero or more alpha characters and/or spaces`, as it's possible that neither of those character types may be present in that particular position (as indicated by the test data I posted)? The *important* bit there, I suppose, would be the fact that there ***must*** be a numeric value following the "indicator" (*e.g.*, "`L`", "`B`", etc.) – G_Hosa_Phat Jul 24 '20 at 19:01
  • 1
    It's `Zero or more alpha characters OR spaces` ie: `[a-z ]*` To use _and_ with it implies the inclusion of both, 0 or more times. –  Jul 24 '20 at 19:06
  • Ahh. I understand what you're saying and I apologize for the miscommunication on my part. You are correct. Any, all, or none of these characters *may* be included after the "indicator". I will attempt to update my rules section to clarify this. I've also added some additional examples for these other conditions. – G_Hosa_Phat Jul 24 '20 at 19:14
  • 1
    E.g., `S\d+(?:-| |,)(?:T\d+\w*|\d+-*\d*)(?:,R| R|-R)*(?:\d+\w*)(?:-| )?[A-Z ]*`? (just written here, based on the *instructions*, so...). The first *sample* appears to stop at the first comma. – Jimi Jul 24 '20 at 19:17
  • 1
    For the _LOT BLOCK_ I was going to say this `L[A-Z 0-9]*[0-9][A-Z 0-9-]*` but when I saw you matching `L23&24 B10` it dawned on me this is the regex to match what your sample says `L[^,]*` –  Jul 24 '20 at 19:19
  • Thank you, @Jimi. I've "tweaked" it just a little bit, but I think this might pretty much get me the SECTION/TOWNSHIP/RANGE values: `S[A-Z ]*\d+(?:-| |,)(?:T[A-Z ]*\d+\w*|\d+-*\d*)(?:,R| R|-R)*[A-Z ]*(?:\d+\w*)(?:-| )?[A-Z0-9 -]*` [S-T-R Legal Description](https://regexr.com/591vm). With a little modification, I *believe* I may have gotten it right for the LOT/BLOCK values as well: `L[A-Z ]*\d+(?:-| |,|&)(?:B[A-Z ]*\d+\w*|\d+-*\d*)?[A-Z0-9 -]*` [LOT/BLOCK Legal Description](https://regexr.com/591pv). – G_Hosa_Phat Jul 24 '20 at 20:03
  • Thank you again, @Maxt8r. I must've missed a couple of use-case scenarios in my OP. I've added a couple of examples from my production data that appear to make this a bit more challenging. Also, there are conditions where the simple `L[^,]*` actually causes some issues, picking up characters further into the string. While I could absolutely ignore those additional matches in my code, I believe what I have now (see my response to Jimi) may be more effective. – G_Hosa_Phat Jul 24 '20 at 20:13
  • 1
    Right, I didn't see the last 2 lines before. It would change just a bit, though: `S[A-Z]*\d*(?:-| |,)*(?:T\d*\w*|\d+-*\d*|\w* )*(?:,R| R|-R)*\d+\w*(?:-| )?[A-Z ]*`. It can be improved a lot though, there are a couple of sub-patterns... As it is, it's somewhat slow – Jimi Jul 24 '20 at 20:25
  • 1
    It doesn't matter you keep adding allowable characters to the set, the principle is the same, you always allow up to the first comma, that is the only constant boundary. Of course it has to be adjusted to only match at the BOL and not span lines: [(?<!\S)L\[^,\r\n\]*](https://regexr.com/5922d) –  Jul 24 '20 at 20:31

1 Answers1

0

Simply matching everything up to the first comma , produces the results you seek for 21 of the 23 examples: ^(?:(?!,).)*

const data = [
  {
    "inputted": "1.    L2 B64,SOMECITY ,OK,74999",
    "expected": "1.    L2 B64"
  },
  {
    "inputted": "2.    L2 B4 RHODA 1 ADDN,SOMECITY,OK,74999 - 81 HILLCREST MH",
    "expected": "2.    L2 B4 RHODA 1 ADDN"
  },
  {
    "inputted": "3.    L20-22 B10 LETCHERS ADDN,SOMECITY,OK,74999",
    "expected": "3.    L20-22 B10 LETCHERS ADDN"
  },
  {
    "inputted": "4.    L2 S10-13-18,SOMECITY,OK,74999",
    "expected": "4.    L2 S10-13-18"
  },
  {
    "inputted": "5.    L23&24 B10, SOMECITY, OK, 74999",
    "expected": "5.    L23&24 B10"
  },
  {
    "inputted": "6.    L21 OAKMONT ADD, SOMECITY OK, 74999",
    "expected": "6.    L21 OAKMONT ADD"
  },
  {
    "inputted": "7.    L24-30 B42,SOMECITY,OK,74999",
    "expected": "7.    L24-30 B42"
  },
  {
    "inputted": "8.    L24 DOGWOOD ESTATES",
    "expected": "8.    L24 DOGWOOD ESTATES"
  },
  {
    "inputted": "9.    L27 B2 LAKE RHONDA, 82 SKYLINE MH",
    "expected": "9.    L27 B2 LAKE RHONDA"
  },
  {
    "inputted": "10.   L3 B2 STONEBROOK II ADDN, SOMECITY, OK, 74999",
    "expected": "10.   L3 B2 STONEBROOK II ADDN"
  },
  {
    "inputted": "11.   LOT 22 BLOCK 1 OF BEACON,SOMECITY,OK,74999",
    "expected": "11.   LOT 22 BLOCK 1 OF BEACON"
  },
  {
    "inputted": "12.   LOT 44-45 WILLIAM DR,SOMECITY,OK,74999",
    "expected": "12.   LOT 44-45 WILLIAM DR"
  },
  {
    "inputted": "13.   L 10,11 B17,SOMECITY,OK",
    "expected": "13.   L 10,11 B17"
  },
  {
    "inputted": "14.   L 8 B 4 HISEL EST.",
    "expected": "14.   L 8 B 4 HISEL EST."
  },
  {
    "inputted": "1.    S18-31-21,ANOTHERCITY,OK,74998",
    "expected": "1.    S18-31-21"
  },
  {
    "inputted": "2.    S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO",
    "expected": "2.    S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO"
  },
  {
    "inputted": "3.    S20-T12N-R16E-MCINTOSH CO,ANOTHERCITY,OK,74998",
    "expected": "3.    S20-T12N-R16E-MCINTOSH CO"
  },
  {
    "inputted": "4.    S20,T12N,R19E,MUSKOGEE CO",
    "expected": "4.    S20,T12N,R19E"
  },
  {
    "inputted": "5.    S2,T15,R25, 86 REGAL 14X60 MH CFC2086F4KL100000",
    "expected": "5.    S2,T15,R25"
  },
  {
    "inputted": "6.    S18-31-21,ANOTHERCITY,OK,74998",
    "expected": "6.    S18-31-21"
  },
  {
    "inputted": "7.    S21 T6N R1E",
    "expected": "7.    S21 T6N R1E"
  },
  {
    "inputted": "8.    S21 T6N R1E,ANOTHERCITY,OK,74998",
    "expected": "8.    S21 T6N R1E"
  },
  {
    "inputted": "9.    SEC 33 TWP 19 RNG 19-BARN,ANOTHERCITY,OK,74998",
    "expected": "9.    SEC 33 TWP 19 RNG 19-BARN"
  },
  {
    "inputted": "10.   SECT 1-22-22 METAL BARN,ANOTHERCITY,OK,74998",
    "expected": "10.   SECT 1-22-22 METAL BARN"
  }
];

const regExp = /^(?:(?!,).)*/;

let matches = 0;
const total = data.length;
data.forEach(({ inputted: input, expected }, i) => {
  const match = input.match(regExp)?.[0];
  console.log(i + 1 + ') matched ', match);
  console.log((i < 10 ? '   ' : '    ') + 'expected', expected);
  matches += match == expected;
});
const successRate = ((matches / total) * 100).toFixed(2);
console.log(successRate + '% success rate');
GirkovArpa
  • 4,427
  • 4
  • 14
  • 43