Disease named entity recognition

Question

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:

Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.

What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.

Thanks in advance for your answers!

That sounds very specific and not a programming problem *per se*. At least not as expressed here. — Brian Agnew, Sep 25 '12 at 08:22

score 6 · Answer 1 · answered May 14 '13 at 03:08

6

Here are two pipelines that are specifically designed for medical document parsing:

Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.

answered May 14 '13 at 03:08

Pascal

16,846
4
60
69

4

I'm not sure I'd classify them as "easy to set up" but they do work rather well. A new version of MetaMap was released late last year as well. – Brian Dolan Jun 25 '15 at 23:05

score 2 · Answer 2 · answered Sep 25 '12 at 14:56

See http://www.ebi.ac.uk/webservices/whatizit/info.jsf

Whatizit is a text processing system that allows you to do textmining tasks on text. The tasks come defined by the pipelines in the drop down list of the above window and the text can be pasted in the text area.

You could also ask biostars: http://www.biostars.org/show/questions/

score 2 · Answer 3 · answered May 04 '13 at 20:34

there are many tools to do that. some popular ones:

NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)

most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.

more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.

score 0 · Answer 4 · answered Apr 06 '18 at 08:37

Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.

score 0 · Answer 5 · answered Apr 28 '18 at 16:25

0

a bash script that has as example a lexicon generated from the disease ontology: https://github.com/lasigeBioTM/MER

answered Apr 28 '18 at 16:25

FCouto

66
4

Links are fantastic, but they should never be the only piece of information in your answer. https://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers/8259#8259 – sɐunıɔןɐqɐp Apr 28 '18 at 16:48

Disease named entity recognition

5 Answers5