I was trying to build an entity resolution system, where my entities are,
(i) General named entities, that is organization, person, location,date, time, money, and percent.
(ii) Some other entities like, product, title of person like president,ceo, etc.
(iii) Corefererred entities like, pronoun, determiner phrase,synonym, string match, demonstrative noun phrase, alias, apposition.
From various literature and other references, I have defined its scope as I would not consider the ambiguity of each of the entity beyond its entity category. That is, I am taking Oxford of Oxford University as different from Oxford as place, as the previous one is the first word of an organization entity and second one is the entity of location.
My task is to construct one resolution algorithm, where I would extract and resolve the entities.
So, I am working out an entity extractor in the first place. In the second place, if I try to relate the coreferences as I found from various literatures like this seminal work, they are trying to work out a decision tree based algorithm, with some features like, distance, i-pronoun, j-pronoun, string match, definite noun phrase, demonstrative noun phrase, number agreement feature, semantic class agreement, gender agreement, both proper names, alias, apposition etc.
The algorithm seems a nice one where enities are extracted with Hidden Markov Model(HMM).
I could work out one entity recognition system with HMM. Now I am trying to work out a coreference as well as an entity resolution system. I was trying to feel instead of using so many features if I use an annotated corpus and train it directly with HMM based tagger, with a view to solve a relationship extraction like,
*"Obama/PERS is/NA delivering/NA a/NA lecture/NA in/NA Washington/LOC, he/PPERS knew/NA it/NA was/NA going/NA to/NA be/NA
small/NA as/NA it/NA may/NA not/NA be/NA his/PoPERS speech/NA as/NA Mr. President/APPERS"
where, PERS-> PERSON
PPERS->PERSONAL PRONOUN TO PERSON
PoPERS-> POSSESSIVE PRONOUN TO PERSON
APPERS-> APPOSITIVE TO PERSON
LOC-> LOCATION
NA-> NOT AVAILABLE*
would I be wrong? I made an experiment with around 10,000 words. Early results seem encouraging. With a support from one of my colleague I am trying to insert some semantic information like, PERSUSPOL, LOCCITUS, PoPERSM, etc. for PERSON OF US IN POLITICS, LOCATION CITY US, POSSESSIVE PERSON MALE, in the tagset to incorporate entity disambiguation at one go. My feeling relationship extraction would be much better now. Please see this new thought too. I got some good results with Naive Bayes classifier also where sentences having predominately one set of keywords are marked as one class.
If any one may suggest any different approach, please feel free to suggest so.
I use Python2.x on MS-Windows and try to use libraries like NLTK, Scikit-learn, Gensim, pandas, Numpy, Scipy etc.
Thanks in Advance.