How to classify unseen text data?

Question

I am training an text classifier for addresses such that if given sentence is an address or not.

Sentence examples :- 
(1) Mirdiff City Centre, DUBAI United Arab Emirates 
(2) Ultron Inc. <numb> Toledo Beach Rd #1189 La Salle, MI <numb>
(3) Avenger - HEAD OFFICE P.O. Box <numb> India

As addresses can be of n types it's very difficult to make such classifier. Is there any pre-trained model or database for the same or any other non ML way.

Are you trying to classify if something has the form of an address, such as `42 wallaby way Sydney`or if this address actually exist in the real world? — chefhose, Jan 20 '20 at 11:48
Take a look at `https://smartystreets.com/articles/does-google-parse-standardize.` — ShpielMeister, Jan 25 '20 at 06:22
Do you have a dataset that needs to be classified? If you can upload it (or at least part of it) somewhere, we would have a better idea of what kind of data we are dealing with. Also we can test the accuracy of our proposed methods. — LoMaPh, Jan 25 '20 at 21:19

score 2 · Answer 1 · answered Jan 25 '20 at 20:30

As mentioned earlier, verifying that an address is valid - is probably better formalized as an information retrieval problem rather than a machine learning problem. (e.g. using a service).

However, from the examples you gave, it seems like you have several entity types that reoccur, such as organizations and locations.

I'd recommend enriching the data with a NER, such a spacy, and use the entity types for either a feature or a rule.

Note that named-entity recognizers rely more on context than the typical bag-of-words classifier, and are usually more robust to unseen data.

score 0 · Answer 2 · answered Jan 16 '20 at 06:48

When I did this the last time the problem was very hard, esp. since I had international adresses and the variation across countries is enormous. Add to that the variation added by people and the problem becomes quite hard even for humans.

I finally build a heuristic (contains it some like PO BOX, a likely country name (grep wikipedia), maybe city names) and then threw every remaining maybe address into the google maps API. GM is quite good a recognizing adresses, but even that will have false positives, so manual checking will most likely be needed.

I did not use ML because my adress db was "large" but not large enough for training, esp. we lacked labeled training data.

I am classifying for couple of more classes as well so this will make the inference slow — hR 312, Jan 16 '20 at 06:55

score 0 · Answer 3 · answered Jan 20 '20 at 10:41

As you are asking for recommendation for literature (btw this question is probably to broad for this place), I can recommend you two links: https://www.reddit.com/r/datasets/comments/4jz7og/how_to_get_a_large_at_least_100k_postal_address/

https://www.red-gate.com/products/sql-development/sql-data-generator/

https://openaddresses.io/

You need to build a labeled data as @Christian Sauer has already mentioned, where you have examples with adresses. And probably you need to make false data with wrong adresses as well! So for example you have to make sentences with only telephone numbers or whatever. But in anyway this will be a quite disbalanced dataset, as you will have a lot of correct adresses and only a few which are not adresses. In total you would need around 1000 examples to have a starting point for it.

Other option is to identify the basic adresses manually and do a similarity analysis to identify the sentences which are clostet to it.

score 0 · Answer 4 · answered Jan 27 '20 at 03:55

As mentioned by Uri Goren, the problem is of Named entity recognition, while there are a lot of trained models in the market. Still, the best one cant get is the Stanford NER.

https://nlp.stanford.edu/software/CRF-NER.shtml It is a conditional random field NER. It is available in java.

If you are looking for a python implementation of the same. Have a look at: How to install and invoke Stanford NERTagger?

Here you can gather info from a multiple sequence of tags like , , or any other sequence like that. If it doesn't give you the correct stuff, it will still somehow get you closer to any address in the whole document. That's a head start.

Thanks.

How to classify unseen text data?

4 Answers4