6

I was trying to build an entity resolution system, where my entities are,

(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc. 
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition. 

From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.

My task is to construct one resolution algorithm, where I would extract and resolve the entities.

So, I am working out an entity extractor in the first place. In the second place, if I try to relate the coreferences as I found from various literatures like this seminal work, they are trying to work out a decision tree based algorithm, with some features like, distance, i-pronoun, j-pronoun, string match, definite noun phrase, demonstrative noun phrase, number agreement feature, semantic class agreement, gender agreement, both proper names, alias, apposition etc.

The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).

I could work out one entity recognition system with HMM. Now I am trying to work out a coreference as well as an entity resolution system. I was trying to feel instead of using so many features if I use an annotated corpus and train it directly with HMM based tagger, with a view to solve a relationship extraction like,

*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"

where, PERS-> PERSON
       PPERS->PERSONAL PRONOUN TO PERSON
       PoPERS-> POSSESSIVE PRONOUN TO PERSON
       APPERS-> APPOSITIVE TO PERSON
       LOC-> LOCATION
       NA-> NOT AVAILABLE*

would I be wrong? I made an experiment with around 10,000 words. Early results seem encouraging. With a support from one of my colleague I am trying to insert some semantic information like, PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now. Please see this new thought too. I got some good results with Naive Bayes classifier also where sentences having predominately one set of keywords are marked as one class.

If any one may suggest any different approach, please feel free to suggest so.

I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim, pandas, Numpy, Scipy etc.

Thanks in Advance.

Coeus2016
  • 355
  • 4
  • 14
  • There was a posting problem. Both the examples were taken as code by auto text editor. But they are not codes rather examples. – Coeus2016 Apr 10 '16 at 20:33

1 Answers1

3

It seems that you are going in three different paths that are totally different and each can be done in a stand alone Phd. There are many literature about them. My first advice focus on the main task and outsource the remaining. If you are going to develop this for non-famous language, also, you can build on others.

Named Entity Recognition

Standford NLP have really go too far in that specially for English. They resolve named entities really good, they are widely used and have a nice community.

Other solution may exist in openNLP for python .

Some tried to extend it to unusual fine-grain types but you need much bigger training data to cover the cases and the decision becomes much harder.

Edit: Stanford NER exists in NLTK python

Named Entity Resolution/Linking/Disambiguation

This is concerned with linking the name to some knowledge base, and solves the problem of whether Oxford University of Oxford City.

AIDA: is one of the state-of-art in that. They uses different context information as well as coherence information. Also, they have tried supporting several languages. They have a good bench mark.

Babelfy: offers interesting API that does NER and NED for Entities and concepts. Also, they support many language but never worked very well.

others like tagme and wikifi ...etc

Conference Resolution

Also Stanford CoreNLP has some good work in that direction. I can also recommend this work where they combined Conference Resolution with NED.

Mohamed Gad-Elrab
  • 636
  • 1
  • 6
  • 20