6

I want to use GermaNet for the lemmatization (corresponding to getLemma() in WordNet), of a list (actually DTM terms -- for enhancing text classification performance). But, I couldn't find any hint, or R package for GermaNet. Is it somehow possible to still use it in R?

Thomas
  • 43,637
  • 12
  • 109
  • 140
alex
  • 1,103
  • 1
  • 14
  • 25
  • 1
    According to Prof. Ingo Feinerer (tm-package co-developer) there is no actual GermaNet version for R (RWeka). – alex Mar 26 '14 at 16:49

1 Answers1

1

I assume you have access to the raw files where the wordnet data is stored (Germanet seems to allow for a free licency). You could parse them (simply using some nifty regular expressions) and extract the information you need (I don't know exactly what a DTM is, but I suppose it's something to do with synsets or links between then). A wordnet (not German) I worked on was organized in multiple files, some containing the links, some information in a form like

0 @1@ WORD_MEANING
  1 PART_OF_SPEECH "v"
  1 VARIANTS
    2 LITERAL "someverb"
      3 SENSE 7
      3 DEFINITION "adefinition"
      3 EXAMPLES
        4 EXAMPLE "anexample"
      3 EXTERNAL_INFO
...

That shouldn't be too hard to parse.

user3554004
  • 1,044
  • 9
  • 24
  • DTM means [Document-Term Matrix](https://en.wikipedia.org/wiki/Document-term_matrix), briefly: it is a big matrix where different (many) text documents are arranged in rows, while each of the columns represent a specific word's (from all used in all documents) frequency in the row document. The access to GermaNet is [here](http://www.sfs.uni-tuebingen.de/GermaNet/data_format.shtml). I'm rather lame regarding regular expressions and not in the field anymore. However [Ingo Feinerer](http://tiny.cc/10wg0x) told me (2014) it would be nice to have a correspondent for R. I suppose also popular. – alex Jul 11 '15 at 06:53
  • Now I see. I know what a doc-term matrix is, just didn't make the connection in this context. But then what you need is a proper lemmatizer (parse the text before constructing the DTM). Looking up a stem/lemma for a text word from WN (particularly using just regular expressions) wouldn't indeed work very well, German morphology is too complex for that. 'German lemmatizer' gives plenty of hits on Google, try some out. Good luck! – user3554004 Jul 12 '15 at 21:50
  • thanks a lot :) do you have any suggestion which lemmatiser (eventually free) is more proper? – alex Jul 19 '15 at 14:44
  • @alex I'm not personally familiar with the state of the art of NLP tools for German, but I'm sure if you read up on a few papers you'll soon get an idea which is the best (when somebody builds new software, they usually compare it to previous tools that do the same thing). Or just choose one which you find the most compatible for your project (i.e., one that is free, is not online-only (some could be), and is easy to integrate with what you are doing). – user3554004 Jul 20 '15 at 12:46