10

I was looking Stanford NER and thinking of using JAVA Apis it to extract postal address from a text document. The document may be any document where there is an postal address section e.g. Utility Bills, electricity bills.

So what I am thinking as the approach is,

  1. Define postal address as a named entity using LOCATION and other primitive named entities.
  2. Define segmentation and other sub process.

I am trying to find a example pipeline for the same (what are the steps in details required), anyone has done this before? Suggestions welcome.

yadab
  • 2,063
  • 1
  • 16
  • 24
  • Do you have a training set of addresses in text? – Gabor Angeli Dec 23 '15 at 02:39
  • @GaborAngeli Yes, I do have addresses in text for a country but not labelled properly with respect to zip, city, addressline1, addressline2. – yadab Dec 23 '15 at 06:31
  • 2
    In that case, my recommendation is to either collect a dataset of addresses tagged in text, and then train something like the Stanford NER system. Or, build some heuristic rule-based system off of a combination of Stanford NER and TokensRegexNER. – Gabor Angeli Dec 23 '15 at 07:37
  • @GaborAngeli I like the idea of tagging addresses in text. My question now is, should I divide addresses into multiple parts, e.g. {city, zip, line1, line 2} and somehow define a compound entity with respect to existing defined named entity LOCATION or define address as a new entity with some lose structure? Any suggestion ? – yadab Dec 24 '15 at 04:12
  • 1
    I'd imagine you can only win by separating the address into different components; it gives the hidden states of the sequence model more to work off of in terms of the structure of an address, and lets each class handle a narrower range of words. If nothing else, you can collapse the states easily and try it out. On the other hand, it's also more annotation effort, and quite possibly won't make a huge difference. How many sentences do you intend to tag? – Gabor Angeli Dec 24 '15 at 06:59
  • @GaborAngeli thanks. Yes, Separating looks promising but The document will be ~100 lines long and the address is in multiple consecutive lines (1-6). why do you say "quite possibly won't make a huge difference" ? – yadab Dec 31 '15 at 04:44
  • @GaborAngeli also can you write in an answer ? Thanks. – yadab Dec 31 '15 at 04:45
  • @yadab I am trying to solve the similar problem and I am training an NER model using spacy. I need labeled training set if you have can please share with me? – Parvez Khan Jun 05 '18 at 09:48

1 Answers1

2

To be clear: all credit goes to Raj Vardhan (and John Bauer) who had an interaction on the [java-nlp-user] mailing list.

Raj Vardhan wrote about the plan to work on "finding street address in a sentence":

Here is an approach I have thought of:

  1. Find the event-anchor in a sentence
  2. Select outgoing-edges in the SemanticGraph from that event-node with relations such as *"prep-in" *or "prep-at".
  3. IF the dependent value in the relation has POS tag as NNP

a) Find outgoing-edges from dependent value's node with relations such as "nn"

b) Connect all such nodes in increasing order of occurrence in the sentence.

c) PRINT resulting value as Location where the event occurred

This is obviously with certain assumptions such as direct dependency between the event-anchor and location in a sentence.

Not sure whether this could help you, but I wanted to mention it just in case. Again, any credit should go to Raj Vardhan (and John Bauer).

Freek de Bruijn
  • 3,552
  • 2
  • 22
  • 28
  • thanks. I am going to try out this as well but if the location is spread across multiple line, segmentation is becoming little bit tricky. I will update with my findings. – yadab Dec 31 '15 at 05:27
  • @yadab how did you make out with this? I am looking to do something similar and don't want to reinvent the wheel. – Todd Jul 08 '16 at 17:16