Questions tagged [apache-tika]

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries.

Tika provides capabilities for identification of more than 1400 file types from the Internet Assigned Numbers Authority taxonomy of MIME types.

For most of the more common and popular formats, Tika then provides content extraction, metadata extraction and language identification capabilities.

While Tika is written in Java, it is widely used from other languages. The RESTful server and CLI Tool permit non-Java programs to access the Tika functionality.

tika-framework

Related Tags:

1283 questions
13
votes
1 answer

Spring & Tika integration: is my approach thread-safe?

I'm interested in Spring & Apache Tika integration. Is this approach thread-safe? Can I safely call detect() method from different threads? Are there any Spring-Tika integration patterns? Thanks in…
Maciej Ziarko
  • 11,494
  • 13
  • 48
  • 69
11
votes
3 answers

Is it possible to extract text by page for word/pdf files using Apache Tika?

All the documentation I can find seems to suggest I can only extract the entire file's content. But I need to extract pages individually. Do I need to write my own parser for that? Is there some obvious method that I am missing?
Asif Sheikh
  • 1,065
  • 2
  • 8
  • 18
11
votes
1 answer

Apache Tika maxStringLength reached

l have thousands of pdf documents that are 11-15mb. My program says that my document contains more than 100k characters. Error output: Exception in thread "main" org.apache.tika.sax.WriteOutContentHandler$WriteLimitReachedException: Your…
Alican Balik
  • 1,284
  • 1
  • 8
  • 22
10
votes
0 answers

How to configure Apache Tika in a kube environment to obtain maximum throughput when parsing a massive number of documents?

I am attempting to Tika parse dozens of millions of office documents. Pdfs, docs, excels, xmls, etc. Wide assortment of types. Throughput is very important. I need to be able parse these files in a reasonable amount of time, but at the same time,…
Nicholas DiPiazza
  • 10,029
  • 11
  • 83
  • 152
10
votes
2 answers

C/C++ alternative to Apache Tika

I am looking for a C/C++ alternative for Apache Tika framework which is Java based. Specifically, I am searching for file meatadata and structured text extraction all under one framework. After some online searching and browsing the closest thing I…
Nik
  • 293
  • 5
  • 14
10
votes
2 answers

How to fix "Cannot read JPEG2000 image: Java Advanced Imaging (JAI) Image I/O Tools are not installed"

I am setting up a java project where I use pdfBox to get images out of PDF. Since I am using tika-app for my other functions, I decided to go with pdfBox present inside tika-app-1.20.jar. I have tried including the jai-imageio-core-1.3.1.jar…
Santhosh
  • 431
  • 5
  • 16
10
votes
2 answers

Stopping a Tika server properly

In order to start a Tika server that can be accessed from hosts other that localhost we know that the way to go is (say I have version 1.7 and want to run on port 9998) java -jar tika-server-1.7-SNAPSHOT.jar -host 0.0.0.0 My question is: Is there a…
pebox11
  • 3,377
  • 5
  • 32
  • 57
9
votes
1 answer

Tika AutoDetectParser returning empty string?

I'm attempting to use Tika's AutoDetectParser to pull a file's content. I originally thought this was a dependency issue but cannot fathom how that could still be true now that i'm including all of tika-app in my jar. AutoDetect Parser returns…
Pat
  • 101
  • 1
  • 5
9
votes
1 answer

Font issue on Ubuntu machine in parsing PDF File

I have an application on my Ubuntu 14.04.x Machine. This application does text mining on PDF files. I suspect that it is using Apache Tika etc... The problem is that, during its reading process, I get the following warning: 2015-09-10 14:15:35…
MaatDeamon
  • 9,532
  • 9
  • 60
  • 127
9
votes
1 answer

Apache Tika extract scanned PDF files

i'm having some troubles using Apache TIKA (version 1.10). I got some PDF files which are just scanned pieces of paper. That means each page is just an image. My goal is to extract the text of the PDF files anyway. My tesseract is set up correctly…
LorisBachert
  • 283
  • 1
  • 2
  • 12
9
votes
4 answers

How to add new mime type to apache tika

This is my class for reading mime types. I am trying to add a new mime type(properties file) and read it. This is my class file: /* * To change this license header, choose License Headers in Project Properties. * To change this template file,…
kittu
  • 6,662
  • 21
  • 91
  • 185
9
votes
4 answers

Is it possible to extract table infomation using Apache Tika?

I am looking at a parser for pdf and MS office document formats to extract tabular information from files. Was thinking of writing separate implementations when I saw Apache Tika. I am able to extract full text from any of these file formats. But my…
rajesh
  • 3,247
  • 5
  • 31
  • 56
8
votes
4 answers

How do I configure the pom.xml of Tika to stop getting all the license dependency warnings?

I am getting all these warnings from Tika when I try to use it: Feb 24, 2018 9:24:35 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem WARNING: JBIG2ImageReader not loaded. jbig2 files will be ignored See …
jnbdz
  • 4,863
  • 9
  • 51
  • 93
8
votes
2 answers

Paragraph Segmentation using Machine Learning

I have a large repository of documents in PDF format. The documents come from different sources, and have no one single style. I use Tika to extract the text from the documents, and now I'd like to segment the text into paragraphs. I can't use…
Gino
  • 675
  • 2
  • 10
  • 20
8
votes
2 answers

Convert .docx to HTML using JAVA

I tried converting .doc to HTML by using WordToHtmlConverter and it worked perfectly. But when i tried to convert .docx to HTML, i got stuck with it. What i tried: I used the below code to convert .docx to HTML: The code which i tried from : How to…
Vignesh Paramasivam
  • 2,360
  • 5
  • 26
  • 57
1
2
3
85 86