Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Apache tika: remove extra line breaks in result string

I have html file:

Test message.

Is Apache Tika able to extract foreign languages like Chinese, Japanese?

Is Apache Tika able to extract foreign languages like Chinese, Japanese? I have the following code: Detector detector = new DefaultDetector(); Parser parser = new AutoDetectParser(detector); InputStream stream = new…

apache apache-tika

asked Mar 26 '13 at 13:58

user2182833

votes

4 answers

how to parse html with nutch and index specific tag to solr?

i have installed nutch and solr for crawling a website and search in it; as you know we can index meta tags of webpages into solr with parse meta tags plugin of nutch.(http://wiki.apache.org/nutch/IndexMetatags) now i want to know is there any way…

solr nutch apache-tika

asked Sep 09 '12 at 12:15

Amir

votes

5 answers

textual content without metadata from Tika via SolrCell

Using Solr 3.6 and the ExtractionRequestHandler (aka Tika), is it possible to map just the textual content (of a PDF) to a field minus the metadata? The "content" field produced by Tika unfortunately contains all the metadata munged in with the text…

solr apache-tika solr-cell

asked Jun 04 '12 at 21:43

Peaeater

votes

2 answers

Get Filename from Byte Array

We can extract the mimetype from byte array, e.g., by using Apache Tika. Is it possible to get Filename from Byte Array.

java arrays filenames apache-tika

asked Mar 27 '12 at 07:14

Tapas Bose

28,796
74
215
331

votes

0 answers

Apache Tika: Parsing only metadata without content extraction

I'm using Apache Tika for extracting metadata from documents. I'm mostly interested in setting up a basic dublin core, like Author, Title, Date, etc. I'm not interested in the content of the documents at all. Currently I'm simply doing the usual…

java metadata apache-tika

asked Feb 08 '12 at 10:43

pokita

1,241
10
12

votes

3 answers

Tika - retrieve main content from docs

GUI utility of Apache Tika provides an option for getting main content ( apart from format text and structured text ) of the given document or the URL. I just want to know which method is responsible for extracting the main content of the docs/url.…

java apache-tika

asked Feb 07 '12 at 08:26

CrazyCoder

2,465
8
36
57

votes

2 answers

ExtractingRequestHandler - how do you post multi-valued literal fields?

I'm trying to post a literal, multi-valued field along with a PDF extract. Only one of the field values seems to be being added to the index. Does this need to be passed in a different way? Currently sending equivalent of (via POST…

solr apache-tika solr-cell

asked Dec 15 '11 at 17:07

paulusm

votes

1 answer

Why is my Tika Metadata object not being populated when using ForkParser?

ForkParser is a new Tika parser that was introduced in Tika version 0.9, located in org.apache.tika.fork. The new parser forks off a new jvm process to analyze the passed file stream. I figured this may be a good way to constrain how much memory…

java memory-management metadata content-type apache-tika

asked Dec 01 '11 at 23:35

anchovie

votes

1 answer

Getting the ExtractingRequestHandler to work in Solr

I am attempting to get Solr to work with Tika so I can index Word and PDF documents in my Drupal web site. I've looked at the Wiki page and this page and they indicate adding a requestHandler in solrconfig.xml. I did that and now Solr throws an…

drupal solr apache-tika solr-cell

asked Oct 27 '11 at 15:56

John81

3,726
6
38
58

votes

2 answers

Searching attachments from a Rails app (Word, PDF, Excel etc)

My first post to Stack Overflow so be gentle please! I am about to start a new Ruby on Rails (3.1) project for a client. One of their requirements is that there is a search engine, which will be indexing roughly 2,000 documents which are a mixture…

ruby-on-rails search attachment apache-tika

asked Oct 12 '11 at 11:14

Mike

9,692
6
44
61

votes

1 answer

custom xpath expression with tika

I am trying build custom xpath contentHandler for tika that recognizes complex xpath expression, by using code from org/apache/tika/sax/BodyContentHandler.java (because I am using tika for other stuff) This xpath…

apache-tika

asked Aug 23 '11 at 20:15

surajz

3,471
3
32
38

votes

1 answer

java.lang.NoSuchMethodError: org.apache.commons.io.IOUtils.read with Tika (detect method)

Here is my method: public String retrieveMimeType(InputStream stream, String filename) throws Exception { TikaConfig config = TikaConfig.getDefaultConfig(); Detector detector = config.getDetector(); TikaInputStream…

java apache-tika

asked Dec 01 '21 at 14:28

Zahreddine Laidi

votes

1 answer

What is the formatting of Solr CEL/Tika output? And how to fix it?

I am using Solr to index DOC, DOCX and PDF files. I had enabled stored for the text and I checked it out. Here's the result from a sample DOC file: , a mobile user interface (UI) software development company, based in Cambridge, UK. After…

lucene solr apache-tika

asked Jul 20 '11 at 17:21

Jesvin Jose

22,498
32
109
202

votes

1 answer

Apache Tika: Parsing a text file omits last part?

I am trying to parse a plain text file using Tika but getting inconsistent behavior. More specifically, I have defined a simple handler as follows: public class MyHandler extends DefaultHandler { @Override public void characters(char ch[],…

java apache apache-tika

asked Jul 07 '11 at 20:25

PNS

19,295
32
96
143

Prev 1 2 3

…

85 86 Next