6

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:

Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.

What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.

Thanks in advance for your answers!

Gibron
  • 1,350
  • 1
  • 9
  • 28
alex
  • 833
  • 4
  • 12
  • 21

5 Answers5

6

Here are two pipelines that are specifically designed for medical document parsing:

Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.

Pascal
  • 16,846
  • 4
  • 60
  • 69
  • 4
    I'm not sure I'd classify them as "easy to set up" but they do work rather well. A new version of MetaMap was released late last year as well. – Brian Dolan Jun 25 '15 at 23:05
2

See http://www.ebi.ac.uk/webservices/whatizit/info.jsf

Whatizit is a text processing system that allows you to do textmining tasks on text. The tasks come defined by the pipelines in the drop down list of the above window and the text can be pasted in the text area.

You could also ask biostars: http://www.biostars.org/show/questions/

Pierre
  • 34,472
  • 31
  • 113
  • 192
2

there are many tools to do that. some popular ones:

most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.

more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.

Vincent Labatut
  • 1,788
  • 1
  • 25
  • 38
0

Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.

Syncrossus
  • 570
  • 3
  • 17
0

a bash script that has as example a lexicon generated from the disease ontology: https://github.com/lasigeBioTM/MER

FCouto
  • 66
  • 4
  • Links are fantastic, but they should never be the only piece of information in your answer. https://meta.stackexchange.com/questions/8231/are-answers-that-just-contain-links-elsewhere-really-good-answers/8259#8259 – sɐunıɔןɐqɐp Apr 28 '18 at 16:48