0

I am working on an Address parsing project where, I need to detect various components of the address, such as city, state, postal_code, street_no etc.

I wrote a regular expression to filter out the postal codes handling all user inputs.

sample_add = "16th main road btm layout 560029 5-6-00-76 56 00 78 560-029 25 -000-1"
regexp = re.compile(r"([\d])[ -]*?([\d])[ -]*?([\d])[ -]*?([\d])[ -]*?([\d])[ -]*?([\d])")
print(re.findall(regexp, sample_add))

Output :- [560029, 560076, 560078, 560029, 250001]

This is able to identify postal_codes for such addresses, However, when an address like the following comes, it combines the Street nos and interprets it as the postal code,

Ex. `sample_add_2 = "House no 323/46 16th main road, btm layout, bengaluru 560029"

In this case, the postal code is identified as 323461, while the correct one should have been 560029.

adiga
  • 34,372
  • 9
  • 61
  • 83
Piyush
  • 1
  • 3
  • 1
    Question has nothing to do with `machine-learning` - kindly do not spam the tag (removed & added `regex` & `python`). – desertnaut Jan 11 '19 at 13:13
  • It is basically a part of Data Preprocessing of a Machine Learning project, where I am having a labelled dataset of addresses and then, I use it to train my model to predict thee components of new addresses – Piyush Jan 11 '19 at 13:15
  • 1
    The fact that one may need help in debugging, say, a sorting algorithm to be subsequently used in a spaceship does not justify the question as being about `space-engineering`... – desertnaut Jan 11 '19 at 13:16

1 Answers1

0

If I undestood it right we search for a 6 digit number but wich can include some delimiters like - , but not \.This should handle it. (If not, please explaind you´re desired outcome):

\b(\d[\- ]*){6}\b(?<! )

https://regex101.com/r/wxYgwr/3

Superluminal
  • 947
  • 10
  • 23
  • If you will give us more information about what should be matched and what should NOT be matched, how the inspected documents look like - we could offer you a good regex. But by now it is to generalised and can match not desired parts. – Superluminal Jan 11 '19 at 14:17