4

I need to build a classifier which identifies NEs in a specific domain. So for instance if my domain is Hockey or Football, the classifier should go accept NEs in that domain but NOT all pronouns it sees on web pages. My ultimate goal is to improve text classification through NER.

For people working in this area please suggest me how should I build such a classifier? thanks!

samsamara
  • 4,630
  • 7
  • 36
  • 66

2 Answers2

1

If all you want is to ignore pronouns, you can run any POS tagger followed by any NER algorithm ( the Stanford package is a popular implementation) and then ignore any named entities which are pronouns. However, the pronouns might refer to named entities, which may or may not turn out to be important for the performance of your classifier. The only way to tell for sure it to try.

A slightly unrelated comment- a NER system trained on domain-specific data (e.g. hockey) is more likely to pick up entities from that domain because it will have seen some of the contexts entities appear in. Depending on the system, it might also pick up entities from other domains (which you do not want, if I understand your question correctly) because of syntax, word shape patterns, etc.

mbatchkarov
  • 15,487
  • 9
  • 60
  • 79
  • what I want to do is a web page classification truly based on NER. That's why I have chosen a narrowed down domain like Hoeky, Football (NOT sports). So I want the classifier to identify pronouns (players names, teams, items manufacturing companies etc., which all might be related) on that domain but NOT all the pronouns. – samsamara Apr 03 '12 at 08:36
  • contd. It's ok to classifier to pick up few unrelated entities since no classifier is of 100% precision. I don't understand how the POS tagger followed by NER algorithm ignores pronouns as you mentioned. I think what I want is what you mentioned in 'A slightly unrelated comment'; yes the classifier should study the contexts the entities appear in. So I will have to collect training data by manually creating lists of those entities right? – samsamara Apr 03 '12 at 08:48
  • The POS tagger I mentioned was not for the NER classifier (although POS tags are useful features)- it's for your postprocessing. After NE tagging, I suggest you remove any named entities whose POS tag is PP. – mbatchkarov Apr 03 '12 at 09:19
  • Thank you for answering. Can you recommend me some useful links to read on this? I have gone through StanfordNER and LingPipe NERecognizer are there any other good ones than this? – samsamara Apr 03 '12 at 09:58
  • I wanna get into you again. :) Lets say I have extracted NEs in a particular web page. As I mentioned, my ultimate goal is to improve text classification via NEs. So would it be ok if I use number of NEs (PERS=x, LOC = y, ORG=z) as features along with the normal text (document) classification features, in order to improve the classification accuracy? Do you think it's fine? – samsamara Apr 10 '12 at 10:32
  • Try including domain-specific counts, eg PER_golf, PER_football, PER_baseball. The idea is to tell your classifier how many named entities for each domain you found. If the document is about baseball, you will find many baseball entities and not many football entities, so this might be a useful feature. You have to try and see. – mbatchkarov Apr 10 '12 at 12:29
0

I think something like AutoNER might be useful for this. Essentially, the input to the system is text documents from a particular domain and a list of domain-specific entities that you'd like the system to recognize (like Hockey players in your case).

According to their results in this paper, they perform well on recognizing chemical names and disease names among others.

Manik bhandari
  • 111
  • 1
  • 4