4

First a little bit of context: I'm trying to identify street addresses in a corpus of documents and we decided that the obvious solution for this would be to use an NLP (Apache OpenNLP in this case) tool to achieve this and so far everything looks great although we still need to train the model with a lot of documents, but that's not really an issue. We improved the solution by adding a extra step for address validation by using the USAddress parser from Datamade. My biggest issue is the fact that the addresses by themselves are nothing without a location next to them, sometimes the location is specified in the text and we will assume that this happens quite often.

Here comes my question: Is there someway to use coreference to associate the entities in the text? Or better yet is there a way to annotate arbitrary words in the text and identify them as being one entity?

I've been looking at the Apache OpenNLP documentation but...it's pretty thin and I think it still needs some work.

  • What do you mean "location"? What's an example of an address with an associated location? – fgregg Jul 01 '16 at 01:10
  • Let's take for example this sentence: "Located at **909 West Temple St.** in the development-heavy Civic Center submarket of **Los Angeles** , the community totals 526 units." in this case Los Angeles would be the location. – Tudor Marghidanu Jul 01 '16 at 08:03
  • So "los Angeles" is a "location"? – fgregg Jul 01 '16 at 13:50
  • In this case yes. What I'm trying to say is that we have to types of entities and I want to establish a relationship between them. One entity is **Address** and the other is **Location**, what I need is a relation between them. I'm curious if I can do that with OpenNLP. – Tudor Marghidanu Jul 01 '16 at 13:52

3 Answers3

1

If you want to use coreference for this problem, you can have a look at this blog

But a simpler solution would be using a sentence detector+ RegEx or a location NER+ sentence detector(presuming addresses are in a single line)

I think the US addresses can be identified using a Regular Expression and once the regex matches, you can use opennlp's sentence detector to print the whole address line.

Similarly you can use NER model provided by opennlp to find locations and print the sentence you want.

Hope this helps!

edit

this Github Repo made it simple for us. Check it out!

iamgr007
  • 966
  • 1
  • 8
  • 28
  • We already can identify the street address and we also identify locations (via Clavin) the thing is that I have to link one address with a location. One simple solution would be to mix and match addresses and locations and geocode them together but that could give false positives. – Tudor Marghidanu Jul 01 '16 at 08:06
  • does the Street address have pincode? you can get the location from that right?? – iamgr007 Jul 01 '16 at 08:12
  • you can do something like a simple if conditon if the location and address are in the same sentence. But try coreference and tell us if it works. all the best – iamgr007 Jul 01 '16 at 08:22
  • Thanks, I'll try to look for locations first. Coreference seems to be a bit problematic, especially since there's no documentation for it in the OpenNLP manual. – Tudor Marghidanu Jul 01 '16 at 08:57
  • @TudorMarghidanu check out that repo I mentioned in the edit. Hope that helps! – iamgr007 Jul 03 '16 at 16:51
0

OpenNLP does not provide a coreference resolution module.
You have to use either Stanford or Illinois or Berkeley system to accomplish the task. They may not work out of the box, you may have to do some parameter tuning or supervised training to achieve reasonable performance.

@edit
Thanks @Alaye for pointing out that OpenNLP does have a coref module, for more details see his answer.

Thanks

Vihari Piratla
  • 8,308
  • 4
  • 20
  • 26
  • 1
    opennlp has coreference resolution, refer to this [blog](http://blog.dpdearing.com/2012/11/making-coreference-resolution-with-opennlp-1-5-0-your-bitch/) – iamgr007 Jul 01 '16 at 06:15
0

Ok, several months later! It wasn't Coref what I was after... what I as actually looking for was Relation Extraction (Information Extraction). I used MITIE (BinaryRelation) and that did the trick, I trained my own model using Brat annotation tool and I got an F1 score of 0.81. Pretty neat...