Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

0 answers

How to limit amount of extracted text with Tika server?

In my scenario, i have some large PDF files and would like to limit the amount of text extracted and returned by tika server. I know it's possible using Java library directly. However, how can I do this when making HTTP requests to tika-server /tika…

apache-tika tika-server

asked Jan 05 '17 at 23:24

Eugene Shvets

4,561
13
19

votes

2 answers

Tika detects docx file as Zip

I have the following test code to detect docx content type: @Test public void testContentTypeOfaWordDOCXFileIsReturnedCorrectlyByTheServer() throws IOException, TikaException { File docxFile = new File(FILE_COMPLETE_PATH); …

java apache-tika

asked Aug 23 '16 at 16:31

qartal

2,024
19
31

votes

0 answers

Apache Tika - Parsing and extracting only metadata without reading content

Is there a way to configure the Apache Tikka so that it only extracts the metadata properties from the file and does not access the content of the file. ? We need a way to do this so as to avoid reading the entire content in larger files. The code…

metadata apache-tika

asked Jun 15 '16 at 02:55

Venki

2,129
6
32
54

votes

1 answer

How to access all the PDF metadata using pdfbox

I have a simple JAVA code that uses TIKA library to get the metadata of a PDF file and it lists the below metadata. Tika code: Metadata metadata = new Metadata(); tika.parse(file, metadata); String[] metadataNames = metadata.names(); for (String…

java pdfbox apache-tika

asked May 03 '16 at 23:36

Learner

2,303
9
46
81

votes

0 answers

Unexpected RuntimeException from tika

I'm trying to extract the content of a large dataset that contains a mix of files (pdf, doc, ppt). I'm using tika-app-1.12.jar, when T run my code everything done perfectly then I got this error Exception in thread "main"…

java apache parsing apache-tika

asked Mar 15 '16 at 13:36

Abeer zaroor

votes

0 answers

How to configure Tesseract language for TikaEntityProcessor in Solr

I have a solr core, and i use TikaEntityProcessor in my DataImportHandler. I have tesseract installed and tika can extract text from images. But the default language is english. Here is the tika part of my data-import-handler.xml file

solr tesseract apache-tika

asked Feb 26 '16 at 13:11

Veysel Ozdemir

votes

1 answer

How to extract hyperlinks from office documents using tika

I'm using Apache Tika to extract raw text from various document formats including office. When extracting text from word documents that include hyperlinks, then only the text is extracted and the information about the hyperlink is lost. Is there a…

hyperlink ms-office extract apache-tika

asked Nov 11 '15 at 13:37

Matthias

votes

0 answers

Apache Tika : How to use XPath queries

I am parsing an XML file using Apache Tika. I would like to extract certain tags with their content from the XML and store them in a HashMap. Right now, i can extract the entire content of the XML but the tags are lost //detecting the file type …

java xml xpath apache-tika

asked Nov 09 '15 at 15:21

AbtPst

7,778
17
91
172

votes

0 answers

Lucene 4 - How to discard numeric terms in index?

I'm using Apache Tika to parse xml document before indexing with Apache Lucene. This is Tika part: BodyContentHandler handler = new BodyContentHandler(10*1024*1024); Metadata metadata = new Metadata(); FileInputStream inputstream = new…

java lucene apache-tika standardanalyzer

asked Feb 10 '15 at 12:09

tommy

votes

2 answers

Extract Images from PDF with Apache Tika

Apache Tika 1.6 has the ability to extract inline images from PDF documents. However, I've been struggling to get it to work. My use case is that I want some code that will extract the content and separately the images from any documents (not…

image pdf apache-tika

asked Sep 11 '14 at 08:58

James Baker

1,143
17
39

votes

4 answers

How to parse large text file with Apache Tika 1.5?

Problem: For my test, I want to extract text data from a 335 MB text file which is wikipedia's "pagecounts-20140701-060000.txt" with Apache Tika. My solution: I tried to use TikaInputStream since it provides buffering, then I tried to use…

java out-of-memory apache-tika

asked Jul 03 '14 at 08:42

minerals

6,090
17
62
107

votes

1 answer

get embedded resourses in doc files using apache tika

I have ms word documents containing text and images. I want to parse them to have xml structure for them. After researching I end up using apache tika for converting my documents. I can parse my doc to xml. here is my code: AutoDetectParser…

java apache-tika

asked Nov 24 '13 at 08:08

Mohamad Ghafourian

1,052
1
14
26

votes

0 answers

How to parse RTF document using Apache Tika in java

I am parsing one Document that contains RTF Content using Apache tika but it is giving some exception. it is not giving contents of document. Here is a piece of code : public String contentEx(File f) throws IOException, SAXException, …

java apache parsing rtf apache-tika

asked Aug 02 '13 at 05:05

Rahul Kulhari

1,115
1
15
44

votes

2 answers

How can I specify encoding when parsing text with Apache TIKA?

The question is pretty self-explanatory. The problem I am facing is that any Tika example code I found online is using a StringWriter, as shown below. If i could somehow make this use an OutputStreamWriter, I can specify the encoding no problem...…

java parsing apache-tika

asked Jun 28 '13 at 00:32

superdemongob

votes

1 answer

HTML Formatted Cell value from Excel using Apache POI

I am using apache POI to read an excel document. To say the least, it is able to serve my purpose as of now. But one thing where I am getting struck is extracting the value of cell as HTML. I have one cell wherein user will enter some string and…

java html excel apache-poi apache-tika

asked May 17 '13 at 13:57

AngelsandDemons

2,823
13
47
70

Prev 1 2 3

…

85 86 Next