I want to recognize named entities in a specific field (e.g. baseball). I know there are tools available like StanfordNER, LingPipe, AlchemyAPI and I have done a little testing with them. But what I want them to be is field specific as I mentioned earlier. How this is possible?
-
1By 'specific field', do you mean specific domain or area? Training the NER to a particular corpus for a specific domain may be one solution. – Kenston Choi Apr 07 '12 at 03:04
-
@Kenston my mistake. yes I mean focused on a specific domain. If I train the NER classifier on names of baseball players for instance, can it keep on accepting only names of that domain and NOT the names of politicians or any other? Do they have such a behavior that I want? – samsamara Apr 07 '12 at 06:58
-
I think it depends on the features used. If the features are more on cases (title or upper cases), then chances are the politician names would be included. Is having a gazetteer (list of player names) not ideal for you? – Kenston Choi Apr 07 '12 at 09:45
-
But how do you create such a list containing names of all the players? – samsamara Apr 07 '12 at 17:57
-
You can mine them from various sources in the Internet, like Wikipedia (http://en.wikipedia.org/wiki/List_of_Major_League_Baseball_players) or sports site. It depends on how exhaustive you want, and how difficult will be your test data. Consider that a baseball player was a former politician. Does the context show that a certain name is likely a player? And in what context do you want to determine the players' names? Or are you trying to determine if a certain name is likely a baseball player, meaning it has something to do with the name regardless of its context? – Kenston Choi Apr 08 '12 at 03:19
-
Thanks for being consistent. This is what I want to do (as posted for below comment): For my research I'm building a focused web crawler which uses NEs to guide its crawl on the given domain (e.g. baseball). The crawler can be solely guided by the NEs or incorporated with machine learning based document classification (which existing approaches does). I'm thinking of a way to do this. please have a look at my this question -stackoverflow.com/questions/10077647/… as well. What are your thoughts on this? Thanks. – samsamara Apr 10 '12 at 05:14
2 Answers
One approach may be to
Use a general (non-domain specific) tool to detect people's names
Use a subject classifier to filter out texts that are not in the domain
If the total size of the data set is sufficient and the accuracy of the extractor and classifier good enough, you can use the result to obtain a list of people's names that are closely related to the domain in question (e.g. by restricting the results to those that are mentioned significantly more often in domain-specific texts than in other texts).
In the case of baseball, this should be a fairly good way of getting a list of people related to baseball. It would, however, not be a good way to obtain a list of baseball players only. For the latter it would be necessary to analyse the precise context in which the names are mentioned and the things said about them; but perhaps that is not required.
Edit: By subject classifier I mean the same as what other people might refer to simply as categorization, document classification, domain classification, or similar. Examples of ready-to-use tools include the classifier in Python-NLTK (see here for examples) and the one in LingPipe (see here).

- 68,383
- 11
- 101
- 131
-
I don't know about subject classifiers. Could you refer me a link for that? This is what I want to do: For my research I'm building a focused web crawler which uses NEs to guide its crawl on the given domain (e.g. baseball). The crawler can be solely guided by the NEs or incorporated with machine learning based document classification (which existing approaches does). I'm thinking of a way to do this. please have a look at my this question -http://stackoverflow.com/questions/10077647/named-entities-as-a-feature-in-text-categorization- as well. I really appreciate your comments. Thanks. – samsamara Apr 10 '12 at 05:10
-
@user601357: I just mean a text classifier, more or less the same as what you refer to as _document classification_. I have added a few links to the answer anyway. – jogojapan Apr 10 '12 at 09:51
-
Thanks. How am I going to incorporate NEs into text classfication? What I have thought up to now is count the number of different Names Entities (PERS=x, LOC = y, ORG=z) and use it as features along with the normal text classification features. What are your thoughts on this? – samsamara Apr 10 '12 at 10:22
-
1@user601357: I'd be surprised if the _number_ of people or locations mentioned in a document was much of an indicator of the domain. But the _names_ themselves certainly are. I guess the most important thing would be to include the names themselves as features. (It would be quite important to check how many extra features that actually gives you. I suppose many traditional approaches use a POS tagger and include noun phrases in the features. Many of those will therefore include names already, because they get them as part of the noun phrases). – jogojapan Apr 10 '12 at 10:55
-
Also with the TF-IDF representation, since we are considering all the terms, we implicitly use NEs right? So isn't there going to be a way to use NEs and improve text classification? oh I'm sort of worried about my research. – samsamara Apr 10 '12 at 11:25
-
A pure TD/IDF approach _might_ use single words only so the NEs (the complete names) would not be included. If the TD/IDF approach is carried out on the basis of n-grams, it would include the NEs, but it would also include a lot of noise. Also, a pure noun phrase extractor (as in my previous comment) would certainly not be as good as a high-quality NE extractor. Then again, are you sure, using NEs for categorization has not been tried before anyway? I just tried "named entities categorization" on Google Scholar -- some of that seems relevant. Of course you can always further improve things... – jogojapan Apr 10 '12 at 12:24
-
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9913/discussion-between-user601357-and-jogojapan) – samsamara Apr 10 '12 at 12:44
Have a look at smile-ner.appspot.com which covers 250+ categories. In particaul, it covers a lot of persons/teams/clubs on sports. May be useful for your purpose.