Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
5
votes
1 answer

Apache tika: remove extra line breaks in result string

I have html file:
Test message.
 
More content here...
 
Best regards,
Mr.…
hard-code
  • 170
  • 1
  • 4
  • 15
5
votes
2 answers

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); InputStream stream = new…
user2182833
  • 51
  • 1
  • 3
5
votes
4 answers

how to parse html with nutch and index specific tag to solr?

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way…
Amir
  • 341
  • 1
  • 5
  • 16
5
votes
5 answers

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text…
Peaeater
  • 626
  • 5
  • 19
4
votes
2 answers

Get Filename from Byte Array

We can extract the mimetype from byte array, e.g., by using Apache Tika. Is it possible to get Filename from Byte Array.
Tapas Bose
  • 28,796
  • 74
  • 215
  • 331
4
votes
0 answers

Apache Tika: Parsing only metadata without content extraction

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual…
pokita
  • 1,241
  • 10
  • 12
4
votes
3 answers

Tika - retrieve main content from docs

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url.…
CrazyCoder
  • 2,465
  • 8
  • 36
  • 57
4
votes
2 answers

ExtractingRequestHandler - how do you post multi-valued literal fields?

I'm trying to post a literal, multi-valued field along with a PDF extract. Only one of the field values seems to be being added to the index. Does this need to be passed in a different way? Currently sending equivalent of (via POST…
paulusm
  • 786
  • 6
  • 19
4
votes
1 answer

Why is my Tika Metadata object not being populated when using ForkParser?

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory…
4
votes
1 answer

Getting the ExtractingRequestHandler to work in Solr

I am attempting to get Solr to work with Tika so I can index Word and PDF documents in my Drupal web site. I've looked at the Wiki page and this page and they indicate adding a requestHandler in solrconfig.xml. I did that and now Solr throws an…
John81
  • 3,726
  • 6
  • 38
  • 58
4
votes
2 answers

Searching attachments from a Rails app (Word, PDF, Excel etc)

My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture…
Mike
  • 9,692
  • 6
  • 44
  • 61
4
votes
1 answer

custom xpath expression with tika

I am trying build custom xpath contentHandler for tika that recognizes complex xpath expression, by using code from org/apache/tika/sax/BodyContentHandler.java (because I am using tika for other stuff) This xpath…
surajz
  • 3,471
  • 3
  • 32
  • 38
4
votes
1 answer

java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.read with Tika (detect method)

Here is my method: public String retrieveMimeType(InputStream stream, String filename) throws Exception { TikaConfig config = TikaConfig.getDefaultConfig(); Detector detector = config.getDetector(); TikaInputStream…
Zahreddine Laidi
  • 560
  • 1
  • 7
  • 20
4
votes
1 answer

What is the formatting of Solr CEL/Tika output? And how to fix it?

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file: , a mobile user interface (UI) software development company, based in Cambridge, UK. After…
Jesvin Jose
  • 22,498
  • 32
  • 109
  • 202
4
votes
1 answer

Apache Tika: Parsing a text file omits last part?

I am trying to parse a plain text file using Tika but getting inconsistent behavior. More specifically, I have defined a simple handler as follows: public class MyHandler extends DefaultHandler { @Override public void characters(char ch[],…
PNS
  • 19,295
  • 32
  • 96
  • 143