10

I am looking for a C/C++ alternative for Apache Tika framework which is Java based. Specifically, I am searching for file meatadata and structured text extraction all under one framework. After some online searching and browsing the closest thing I have is GNU libextractor and a bunch of individual file filters that parse documents to extract text data (pdftoext, xls2csv ..etc)

Can anyone please recommend a good library comparable to Apache's Tika ?

Thanks

Nik
  • 293
  • 5
  • 14

2 Answers2

2

KDE provides a library called KFileMetaData which they internally use for their file indexer.

It uses C++, Qt5 and supports most of the basic formats such as - ms-office-2007, odfs, pdfs, images, video, audio and ebooks.

Vishesh Handa
  • 1,830
  • 2
  • 14
  • 17
1

Tika has a network server mode, so you could always start Tika using that and then send it requests from your C++ code?

Alternately, Tika has a CLI mode, so you could fire off a new Tika process each time and read the data from the pipe.

Gagravarr
  • 47,320
  • 10
  • 111
  • 156
  • This is a nice idea in theory, but has it ever been documented? Understanding the server mode may require some digging through code and discussion groups. Documentation seems to be a bit of a problem on the Tika project, which is unfortunate, because it looks to be a comprehensive tool. – Jason Jun 29 '12 at 23:10
  • Probably only documented in code for now, as it's under active development. If you're interested, best bet is to ask on the mailing list, that might prod one of the committers who look after it to write up some docs :) – Gagravarr Jun 29 '12 at 23:17
  • 2
    For anyone coming to this in future, the question [has now been asked on the Tika users list](http://mail-archives.apache.org/mod_mbox/tika-user/201206.mbox/%3C4FEF52DA.7070908%40consil.co.uk%3E) - long term that thread will hopefully contain the right answer! – Gagravarr Jul 01 '12 at 00:26
  • That was me - I'll follow it through, and if I need to write up some docs, will link it back to here also. Thanks for linking. It makes sense that questions asked in lots of places ultimately lead to the answer *somewhere*. – Jason Jul 01 '12 at 08:21