3

I'm working on a project that already has a C++ base. I would like to have a plug-in for some natural language processing. I really like GATE but I'm not sure if it's worth launching the JVM and splitting the project into C++ and Java portions. I noticed UIMA has a C++ framework, but have not tried it but seems to have less features than GATE.

Does anyone know of a better option than trying to wrap GATE somehow in C++ (eg better NLP library in C++)? If I do wrap GATE in C++, what is the best way? SOA?

Thanks

User1
  • 39,458
  • 69
  • 187
  • 265

3 Answers3

5

A list of resources for NLP (POS Taggers, NP chunking, Sequence models, Parsers...) in C++ and other languages by Christopher Manning. Another one in Wikipedia.

Also there's Boost page for String and text processing.

anno
  • 5,970
  • 4
  • 28
  • 37
1

Of course, it depends on what exactly what you want to do.

GATE and UIMA are both frameworks for NLP, mostly designed around the idea of information management and extraction. It's not really fair to say GATE has more features than UIMA, since strictly they are both only frameworks. However GATE is bundled with ANNIE which does have a lot of nice features which may be useful you (again, depending on what you want to do). UIMA is bundled with the OpenNLP libraries which mirror some, but not all, of these features, but are written in Java so would require loading the JVM.

You could find similar features to GATE/ANNIE or UIMA/OpenNLP using C++ libraries, but the nice thing about the two frameworks is that they are coherent and don't require a lot of 'glue code' to make individual libraries talk to each other.

What's the reason behind not wanting to wrap GATE in C++ code? I can appreciate that it would add to the complexity of the project, but if your worries are about performance/memory then the JVM may be the least of your worries. NLP tools tend to be very memory hungry, expect to give up half a gig for NER models, more for a statistical parser.

Stompchicken
  • 15,833
  • 1
  • 33
  • 38
  • 1
    I'm an NLP newbie so I appreciate your insights! My concerns about Java are half memory/speed and half complexity of the project by adding more languages/compilers/etc. Do you know if UIMA in C++ is less of a resource hog than GATE? Is there a noticeable difference (20% or more in CPU time or RAM consumption)? – User1 Oct 21 '09 at 13:37
  • Sorry, I've never used the C++ version. Most of the best NLP libraries are written in Java for some reason. – Stompchicken Oct 21 '09 at 14:00
1

Maybe you would like to take a look at NLP++, a programming language tailored for Natural Language Processing and Text Analytics.

I receommend to start here:

Getting Started Package for NLP++

This package contains everything you need to get started with NLP++. Yes, you have to learn a new programming language but it is similar to C++ and you don't have to use a black-box API. Further, a compiled text analyzer in VisualText creates a Visual Studio solution which you can include in your other C++ projects.

You can use the VisualText and NLP++ for free for non-commercial projects.

Join the NLP++ Community to ask questions, discuss your analyzers and to learn more about NLP++:

NLP++ Community

Kind regards,

Dominik Holenstein

NLP++ Community Manager