Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

1283 questions

votes

1 answer

Spring & Tika integration: is my approach thread-safe?

I'm interested in Spring & Apache Tika integration. Is this approach thread-safe? Can I safely call detect() method from different threads? Are there any Spring-Tika integration patterns? Thanks in…

spring thread-safety apache-tika

asked Apr 17 '12 at 12:11

Maciej Ziarko

11,494
13
48
69

votes

3 answers

Is it possible to extract text by page for word/pdf files using Apache Tika?

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?

text apache-tika

asked Apr 28 '11 at 20:53

Asif Sheikh

1,065
2
8
18

votes

1 answer

Apache Tika maxStringLength reached

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters. Error output: Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your…

java apache parsing apache-tika

asked Feb 21 '16 at 22:17

Alican Balik

1,284
1
8
22

votes

0 answers

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time,…

java kubernetes apache-tika tika-server

asked Nov 22 '20 at 05:27

Nicholas DiPiazza

10,029
11
83
152

votes

2 answers

C/C++ alternative to Apache Tika

I am looking for a C/C++ alternative for Apache Tika framework which is Java based. Specifically, I am searching for file meatadata and structured text extraction all under one framework. After some online searching and browsing the closest thing I…

java c++ full-text-search metadata apache-tika

asked Jun 03 '11 at 22:11

Nik

votes

2 answers

How to fix "Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed"

I am setting up a java project where I use pdfBox to get images out of PDF. Since I am using tika-app for my other functions, I decided to go with pdfBox present inside tika-app-1.20.jar. I have tried including the jai-imageio-core-1.3.1.jar…

java pdfbox apache-tika jai

asked Aug 29 '19 at 10:01

Santhosh

votes

2 answers

Stopping a Tika server properly

In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-SNAPSHOT.jar -host 0.0.0.0 My question is: Is there a…

java apache-tika

asked Sep 02 '14 at 21:59

pebox11

3,377
5
32
57

votes

1 answer

Tika AutoDetectParser returning empty string?

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar. AutoDetect Parser returns…

java ant apache-tika

asked Dec 21 '15 at 20:04

Pat

votes

1 answer

Font issue on Ubuntu machine in parsing PDF File

I have an application on my Ubuntu 14.04.x Machine. This application does text mining on PDF files. I suspect that it is using Apache Tika etc... The problem is that, during its reading process, I get the following warning: 2015-09-10 14:15:35…

java ubuntu-14.04 text-mining apache-tika

asked Sep 10 '15 at 18:24

MaatDeamon

9,532
9
60
127

votes

1 answer

Apache Tika extract scanned PDF files

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway. My tesseract is set up correctly…

java pdf ocr tesseract apache-tika

asked Sep 02 '15 at 13:13

LorisBachert

votes

4 answers

How to add new mime type to apache tika

This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it. This is my class file: /* * To change this license header, choose License Headers in Project Properties. * To change this template file,…

java apache-tika

asked Jun 17 '15 at 15:19

kittu

6,662
21
91
185

votes

4 answers

Is it possible to extract table infomation using Apache Tika?

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my…

java apache-tika

asked Nov 22 '12 at 16:48

rajesh

3,247
5
31
56

votes

4 answers

How do I configure the pom.xml of Tika to stop getting all the license dependency warnings?

I am getting all these warnings from Tika when I try to use it: Feb 24, 2018 9:24:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See …

java maven pdfbox apache-tika

asked Feb 25 '18 at 04:16

jnbdz

4,863
9
51
93

votes

2 answers

Paragraph Segmentation using Machine Learning

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs. I can't use…

python machine-learning nlp apache-tika text-segmentation

asked Jan 23 '17 at 08:16

Gino

votes

2 answers

Convert .docx to HTML using JAVA

I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly. But when i tried to convert .docx to HTML, i got stuck with it. What i tried: I used the below code to convert .docx to HTML: The code which i tried from : How to…

java apache-tika

asked Jul 09 '14 at 11:51

Vignesh Paramasivam

2,360
5
26
57

Prev 1

…

85 86 Next