3

We want to identify the address fields from a document. For Identifying the address fields we converted the document to OCR files using Tesseract. From the tesseract output we want to check a string contains the address field or not . Which is the right strategy to resolve this problem ?

  1. Its not possible to solve this problem using the regex because address fields are different for various documents and countries
  2. Tried NLTK for classifying the words but not works perfectly for address field.

Required output

I am staying at 234 23 Philadelphia - Contains address files <234 23 Philadelphia>

I am looking for a place to stay - Not contains address 

Provide your suggestions to solve this problem .

Nithin
  • 9,661
  • 14
  • 44
  • 67

5 Answers5

4

As in many ML problems, there are mutiple posible solutions, and the important part(and the one commonly has greater impact) is not which algorithm or model you use, but feature engineering ,data preprocessing and standarization ,and things like that. The first solution comes to my mind(and its just an idea, i would test it and see how it performs) its:

  1. Get your training set examples and list the "N" most commonly used words in all examples(thats your vocabulary), this list will contain every one of the "N" most used words , every word would be represented by a number(the list index)
  2. Transform your training examples: read every training example and change its representation replacing every word by the number of the word in the vocabolary.
  3. Finally, for every training example create a feature vector of the same size as the vocabulary, and for every word in the vocabulary your feature vector will be 0(the corresponding word doesnt exists in your example) or 1(it exists) , or the count of how many times the word appears(again ,this is feature engineering)
  4. Train multiple classifiers ,varing algorithms,parameters, training set sizes, etc, and do cross validation to choose your best model.

And from there keep the standard ML workflow...

Luis Leal
  • 3,388
  • 5
  • 26
  • 49
  • I don't think vectorizing the city names or location names is a good idea. The idea behind a BoW model you have used takes an assumption that you know the vocabulary before hand and in this case, no vocab could have all the city names, state-names or zip code formats. – Pranzell May 08 '20 at 05:49
4

If you are interested in just checking YES or NO and not extraction of complete address, One simple solution can be NER.

You can try to check if Text contains Location or not.

For Example :

import nltk 
def check_location(text):
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text))):
        if hasattr(chunk, "label"):
            if chunk.label() == "GPE" or chunk.label() == "GSP":
                return "True"
    return "False"

text="I am staying at 234 23 Philadelphia."
print(text+" - "+check_location(text))

text="I am looking for a place to stay."
print(text+" - "+check_location(text))

Output:

# I am staying at 234 23 Philadelphia. - True 
# I am looking for a place to stay. - False

If you want to extract complete address as well, you will need to train your own model.

You can check: NER with NLTK , CRF++.

RAVI
  • 3,143
  • 4
  • 25
  • 38
3

You're right. Using regex to find an address in a string is messy.

There are APIs that will attempt to extract addresses for you. These APIs are not always guaranteed to extract addresses from strings, but they will do their best. One example of an street address extract API is from SmartyStreets. Documentation here and demo here.

Something to consider is that even your example (I am staying at 234 23 Philadelphia) doesn't contain a full address. It's missing a state or ZIP code field. This makes is very difficult to programmatically determine if there is an address. Once there is a state or ZIP code added to that sample string (I am staying at 234 23 Philadelphia PA) it becomes much easier to programmatically determine if there is an address contained in the string.

Disclaimer: I work for SmartyStreets

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
camiblanch
  • 3,866
  • 2
  • 19
  • 31
0

A better method to do this task could be as followed below:

  1. Train your own custom NER model (extending pre-trained SpaCy's model or building your own CRF++ / CRF-biLSTM model, if you have annotated data) or using a pre-trained models like SpaCy's large model or geopandas, etc.

  2. Define a weighted score mechanism based on your problem statement. For example - Let's assume every address have 3 important components - an address, a telephone number and an email id. Text that would have all three of them would get a score of 33.33% + 33.33% + 33.33% = 100 %

  3. For identifying if it's an address field or not you may take into account - the per% of SpaCy's location tags (GPE, FAC, LOC, etc) out of total tokens in text which gives a good estimate of how many location tags are present in text. Then run a regex for postal codes, and match the found city names with the 3-4 words just before the found postal code, if there's an overlap, you have correctly identified a postal code and hence an address field - (got your 33.33% score!).

  4. For telephone numbers - certain checks and regex could do it but an important criteria would be that it performs these phone checks only if an address field is located in above text.

  5. For emails/web address again you could perform nomial regex checks and finally add all these 3 scores to a cumulative value.

  6. An ideal address would get 100 score while missing fields wile yield 66% etc. The rest of the text would get a score of 0.

Hope it helped! :)

Pranzell
  • 2,275
  • 16
  • 21
-5

Why do you say regular expressions won't work?

Basically, define all the different forms of address you might encounter in the form of regular expressions. Then, just match the expressions.

Tomasz R
  • 300
  • 1
  • 11
  • There couldn't be enough regex for all the address fields in the world with varying zip code format to city names or abbreviations that might be used for something else in other countries. Only a ML solution is ought to help on this one. – Pranzell May 08 '20 at 05:48