1

I want to do Taxonomy Extraction of a raw large corpus with lots of abbreviations in text.

There is an R package called taxize. This package allows users to search over many taxonomic data sources for species names.

library('taxize')

#Get immediate children of Salmo
children("Salmo", db = 'ncbi')

#> $Salmo
#>    childtaxa_id                   childtaxa_name childtaxa_rank
#> 1       1509524  Salmo marmoratus x Salmo trutta        species
#> 2       1484545 Salmo cf. cenerinus BOLD:AAB3872        species
# 

# Get synonyms
synonyms("Acer drummondii", db="itis")

My question here: is it possible to use taxize (or any alternative package) for taxonomy extraction of a text data given lots of abbreviations in text? For example how can I found immediate children of a specific abbreviation or concept which is a frequent word in my text data but not listed in taxonomic data sources such as "ncbi" and "itis".

Appreciate your comments and answers.

Thanks, Sam

Sam S.
  • 627
  • 1
  • 7
  • 23
  • hi, `taxize` author here. Are you referring to taxonomy in the sense of species names, or in the more general sense cf. https://en.wikipedia.org/wiki/Taxonomy – sckott Jun 11 '18 at 02:15
  • Thanks Sckot for replying. I am wondeing whether it is possible to use customer database, created from my own unstructured text data, in taxize. I have my own vocabulary (including lots of abbriviations such as "rtw","sol","ac"), where some of them may not be in existing database, and if exist they have different meanings. I can find words relations using word embeddings such as word2vec. Now, I want to see possibility of using taxize to get immidiate children of "rtw" or get synonyms for that. – Sam S. Jun 12 '18 at 02:21
  • 1
    `taxize` functions for synonyms/children/etc are mostly just calls to various web APIs that have the logic, and it's specific to species names. There is some discussion in an underlying package `taxa` https://github.com/ropensci/taxa about making it more general. That is, we use it for species names, but you can use it for any hierarchical data. Open any issues there if you check it out and have questions/problems – sckott Jun 12 '18 at 17:22
  • Thanks so much Sckott for the information. – Sam S. Jun 14 '18 at 00:46

0 Answers0