I'm unfortunately in a position where I need to parse out individual address elements from data that was originally entered into a single field in a legacy database. Now I'm attempting to normalize this data (for historical reporting and such) for all future development projects in our company by splitting the information into the appropriate fields (street address, city, state, ZIP code). However, some of the addresses I have to deal with are nothing but legal descriptions of the property - e.g., Lot 1, Block 1, Somecity, ST, 74999
or S1-T1-R1, Anothercity, St, 74998
I believe I've actually gotten most of the standard address information parsing working fairly well, but there's one piece that I'm not sure how to accomplish, mainly because of my relative unfamiliarity with RegEx. What I'd like to do is extract out these legal descriptions before attempting to parse any other information from the field's remaining value. I know what my pattern will "look like", but I'm not sure how to construct the actual RegEx pattern.
I know that I'll be working with some capture groups, but I'm totally lost on how to set this up to match what I want to see. I've tried to play with something like ^((L(\w\s)*\d+(\w*\d*\s*)?)(B(\w\s)*\d+(\w*\d*\s*)?)?\,)
, but that's obviously not it (see my testing on Regexr). I simply don't understand the RegEx syntax well enough yet to know what I'm doing "wrong".
Here are the pattern rules I want to employ with RegEx. I realize that these will likely be two separate "passes" on the data - once to look for the LOT/BLOCK notation, then again for the SECTION/TOWNSHIP/RANGE notation - but that's totally fine. I also understand that nothing will be absolutely 100% accurate in picking up the data and I'll be dealing with some "false" matches, or even "missed" matches that will have to be dealt with manually, but getting this RegEx to extract what I can will be a tremendous head start on normalizing my data. At this point, I'm simply trying to eliminate any "extra" elements from the address before parsing the rest of the data.
The RegEx should only find these patterns at the very beginning of a trimmed string (no leading or trailing spaces) as the legal description - at least in this particular database - will not be found "later" in the address string than at the very beginning.
MATCHING RULES - LOT/BLOCK LEGAL DESCRIPTION
- The first character of the trimmed, uppercase string is "
L
"- In this order, the "
L
" is followed by:- Zero or more alpha characters or spaces (e.g., "
LOT
", "LOTS
", "LT
", etc.) - Must include one or more numeric values
- Zero or more alpha characters or spaces (e.g., "
- There may be another number, a comma, dash or space after the first numeric character (or some combination of these characters)
- In this order, the "
- The above may be followed by a "
B
" character- In this order, the "
B
" is followed by:- Zero or more alpha characters or spaces (e.g., "
BLK
", "BLOCK
", "BLCK
", etc.) - Must include one or more numeric values
- Zero or more alpha characters or spaces (e.g., "
- In this order, the "
- The legal description should terminate either with a comma (indicating the next address element) or at the end of the string. Any characters between the "BLOCK" group and this termination should be included in the legal description (a trailing comma should not be included).
MATCHING RULES - SECTION/TOWNSHIP/RANGE LEGAL DESCRIPTION
- The first character of the trimmed, uppercase string is "
S
"- In this order, the "
S
" will be followed by:- Zero or more alpha characters or spaces (e.g., "
SECTION
", "SEC
", "SECT
", etc.) - One or more numeric values
- Zero or more alpha characters or spaces (e.g., "
- There may be another number, a comma, dash or space after the first numeric character
- In this order, the "
- The above may be followed by a "
T
" character- If present, in this order, the "
T
" will be followed by:- Zero or more alpha characters or spaces (e.g., "
TWNSHP
", "TOWN
", "TWN
", etc.) - Must include one or more numeric values
- Zero or more alpha characters or spaces
- Zero or more alpha characters or spaces (e.g., "
- If the "
T
" character is not present and the string is not terminated (see below), the next character to find should be a number
- If present, in this order, the "
- The above may be followed by an "
R
" character- In this order, the "
R
" will be followed by:- Zero or more alpha characters or spaces (e.g., "
RANGE
", "RNG
", "RG
", etc.) - Must include one or more numeric values
- Zero or more alpha characters or spaces
- Zero or more alpha characters or spaces (e.g., "
- If the "
R
" character is not present and the string is not terminated (see below), the next character to find should be a number
- In this order, the "
- The legal description should terminate either with a comma (indicating the next address element) or at the end of the string. Any characters between the "RANGE" group and this termination should be included in the legal description (a trailing comma should not be included).
Here is some (slightly obfuscated) sample data I'm working with:
SAMPLE DATA - LOT/BLOCK LEGAL DESCRIPTION
1. L2 B64,SOMECITY ,OK,74999
2. L2 B4 RHODA 1 ADDN,SOMECITY,OK,74999 - 81 HILLCREST MH
3. L20-22 B10 LETCHERS ADDN,SOMECITY,OK,74999
4. L2 S10-13-18,SOMECITY,OK,74999
5. L23&24 B10, SOMECITY, OK, 74999
6. L21 OAKMONT ADD, SOMECITY OK, 74999
7. L24-30 B42,SOMECITY,OK,74999
8. L24 DOGWOOD ESTATES
9. L27 B2 LAKE RHONDA, 82 SKYLINE MH
10. L3 B2 STONEBROOK II ADDN, SOMECITY, OK, 74999
11. LOT 22 BLOCK 1 OF BEACON,SOMECITY,OK,74999
12. LOT 44-45 WILLIAM DR,SOMECITY,OK,74999
13. L 10,11 B17,SOMECITY,OK
14. L 8 B 4 HISEL EST.
The results should be:
1. L2 B64
2. L2 B4 RHODA 1 ADDN
3. L20-22 B10 LETCHERS ADDN
4. L2 S10-13-18
5. L23&24 B10
6. L21 OAKMONT ADD
7. L24-30 B42
8. L24 DOGWOOD ESTATES
9. L27 B2 LAKE RHONDA
10. L3 B2 STONEBROOK II ADDN
11. LOT 22 BLOCK 1 OF BEACON
12. LOT 44-45 WILLIAM DR
13. L 10,11 B17
14. L 8 B 4 HISEL EST.
SAMPLE DATA - SECTION/TOWNSHIP/RANGE LEGAL DESCRIPTION
1. S18-31-21,ANOTHERCITY,OK,74998
2. S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO
3. S20-T12N-R16E-MCINTOSH CO,ANOTHERCITY,OK,74998
4. S20,T12N,R19E,MUSKOGEE CO
5. S2,T15,R25, 86 REGAL 14X60 MH CFC2086F4KL100000
6. S18-31-21,ANOTHERCITY,OK,74998
7. S21 T6N R1E
8. S21 T6N R1E,ANOTHERCITY,OK,74998
9. SEC 33 TWP 19 RNG 19-BARN,ANOTHERCITY,OK,74998
10. SECT 1-22-22 METAL BARN,ANOTHERCITY,OK,74998
The results should be:
1. S18-31-21
2. S2 T6N R1E INDIAN MERIDIAN CLEVELAND CO
3. S20-T12N-R16E-MCINTOSH CO
4. S20,T12N,R19E
5. S2,T15,R25
6. S18-31-21
7. S21 T6N R1E
8. S21 T6N R1E
9. SEC 33 TWP 19 RNG 19-BARN
10. SECT 1-22-22 METAL BARN
If you require any additional information, please feel free to let me know. As I stated above, I realize that RegEx isn't going to be a 100% accurate "cure-all" for what I need to do, but I'm really hoping that it will at least give me a significant head start so I don't have to clean up hundreds of thousands of records by hand.