Questions tagged [tika-server]
90 questions
1
vote
1 answer
Tika with Grobid throwing error when parsing pdf document
I am trying to extract both document metadata and journal header metadata from a pdf document. I verified that Tika Server (v1.21 / v1.24) and Grobid (v0.6.0) are independently able to extract metadata from the pdf document. However, when I run…

Subramanyam Avlur
- 11
- 1
1
vote
1 answer
Tika Server - Parse without bookmark and image tags
I am extracting text with tika server v1.20.
Tika adds [bookmark: xx] and [image: xx] in the text. I don't want them.
Sample output:
How the Gifted Brain Learns
David A. Sousa
[image: How the Gifted Brain Learns]
Welcome to our Third Annual…

Montoya
- 2,819
- 3
- 37
- 65
1
vote
0 answers
run tika python with django in docker
I've a django site that parses pdf using tika-python and stores the parsed pdf content in elasticsearch index. it works fine in my local machine. I want to run this setup using docker. However, tika-python does not work as it requires java 8 to run…

Irfan Harun
- 979
- 2
- 16
- 37
1
vote
1 answer
Python Tika cannot read PDF - fail to download Tika Server
I am using Tika to read PDFs and my code was working until yesterday. Now when I runt the same code I get errors and apparently Tika can't find the Tika server jar file. I am using the following code to read the PDF
import tika
from tika import…

Ali
- 7,810
- 12
- 42
- 65
1
vote
1 answer
Python tika parser error - Failed to receive startup confirmation from startServer
I am trying to use Tika in python to parse PDF files. I am using python 2.7 and a Mac. I cannot get it to work. I have installed it, then:
from tika import parser
raw = parser.from_file('...file')
I get this error (edited for brevity):
Retrieving…

bill999
- 2,147
- 8
- 51
- 103
1
vote
0 answers
Apache Tika REST-Server // Code 422 (Unprocessable Entity) for different states? -> How to distinguish?
The Apache Tika REST server provides for a PDF document with password status code 422 (Unprocessable Entity). If the file format is unsupported, 422 is sent as well.
Unfortunately, it is not ppssible to distinguish whether the metadata of a file…

Oliver
- 11
- 2
1
vote
0 answers
TikaJAXRS PUT from Python client
Apache Tika should be accessible from Python program via HTTP, but I can't get it to work.
I am using this command to run the server (with and without the two options at the end):
java -jar tika-server-1.17.jar --port 5677 -enableUnsecureFeatures…

Roman Susi
- 4,135
- 2
- 32
- 47
1
vote
1 answer
What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when?
I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)

ismail josh
- 31
- 3
1
vote
0 answers
Tell Tika not to parse XML
I would like to configure a Tika server that does not parse XML files.
I wrote the following config file:
…

mbl
- 101
- 9
0
votes
0 answers
Tika Docx Scanning for 2 MB file (Pure text docx file) taking more than 30 seconds
I am using tika 2.6.x with java opts as XX:MaxMetaspaceSize=200M -Xss512K -XX:MaxDirectMemorySize=64M for below code. It seems that processing time is very high(around a minute) for a pure text containing docx file of size more than equals to 2 MB.…

DeadPool
- 40
- 8
0
votes
0 answers
Tika Parser is treating .pptx text content as embedded image
I am using tika parser to validate the content of various file types like .docx, .txt, .pptx and many more others. It seems that even for a normal text content pptx file, when running tika parser on it, its responding saying embedded image in file.…

DeadPool
- 40
- 8
0
votes
0 answers
Apache Tika returns 200 on broken PDFs
I'm using Apache Tika server in a Docker container to parse all sorts of files. I noticed that when sending a broken PDF to parse, Tika returned 200 and empty text. I added this line to my config.xml:
true
Which…

user20395797
- 13
- 3
0
votes
0 answers
Issue with apache Tika Extraction for Tabular Column Data in PDF
I extracted a PDF that has tabular column data using apache Tika, in the result the row data from different columns are getting merged
Before Extracting
| Column A | Column B |
| -------- | -------- |
| 1 | saikiran |
| 2 | pavan …
0
votes
0 answers
How to read the images with Tika without using Tesseract Installation
I am using tika parser and tika core 2.x dependencies and want to read the characters inside images. Is there anyway to achieve this using tika without using tesseract installation.

DeadPool
- 40
- 8
0
votes
0 answers
How to extract all of the binary files recursively
I have a .docx file that contains .pptx file that contains images.
I'm trying to figure out how to extract all of the binary files recursively, so
I will be able to get the .pptx but most importantly its images.
I saw there's "/unpack" endpoint but…

Shay Barak
- 61
- 1
- 4