Questions tagged [tika-server]

90 questions
1
vote
1 answer

Tika with Grobid throwing error when parsing pdf document

I am trying to extract both document metadata and journal header metadata from a pdf document. I verified that Tika Server (v1.21 / v1.24) and Grobid (v0.6.0) are independently able to extract metadata from the pdf document. However, when I run…
1
vote
1 answer

Tika Server - Parse without bookmark and image tags

I am extracting text with tika server v1.20. Tika adds [bookmark: xx] and [image: xx] in the text. I don't want them. Sample output: How the Gifted Brain Learns David A. Sousa [image: How the Gifted Brain Learns] Welcome to our Third Annual…
Montoya
  • 2,819
  • 3
  • 37
  • 65
1
vote
0 answers

run tika python with django in docker

I've a django site that parses pdf using tika-python and stores the parsed pdf content in elasticsearch index. it works fine in my local machine. I want to run this setup using docker. However, tika-python does not work as it requires java 8 to run…
Irfan Harun
  • 979
  • 2
  • 16
  • 37
1
vote
1 answer

Python Tika cannot read PDF - fail to download Tika Server

I am using Tika to read PDFs and my code was working until yesterday. Now when I runt the same code I get errors and apparently Tika can't find the Tika server jar file. I am using the following code to read the PDF import tika from tika import…
Ali
  • 7,810
  • 12
  • 42
  • 65
1
vote
1 answer

Python tika parser error - Failed to receive startup confirmation from startServer

I am trying to use Tika in python to parse PDF files. I am using python 2.7 and a Mac. I cannot get it to work. I have installed it, then: from tika import parser raw = parser.from_file('...file') I get this error (edited for brevity): Retrieving…
bill999
  • 2,147
  • 8
  • 51
  • 103
1
vote
0 answers

Apache Tika REST-Server // Code 422 (Unprocessable Entity) for different states? -> How to distinguish?

The Apache Tika REST server provides for a PDF document with password status code 422 (Unprocessable Entity). If the file format is unsupported, 422 is sent as well. Unfortunately, it is not ppssible to distinguish whether the metadata of a file…
Oliver
  • 11
  • 2
1
vote
0 answers

TikaJAXRS PUT from Python client

Apache Tika should be accessible from Python program via HTTP, but I can't get it to work. I am using this command to run the server (with and without the two options at the end): java -jar tika-server-1.17.jar --port 5677 -enableUnsecureFeatures…
Roman Susi
  • 4,135
  • 2
  • 32
  • 47
1
vote
1 answer

What is the diffrence between Tika app, Tika Server and Java Wrapper. Which one is used and when?

I want to use Apache Tika for enterprise-level huge and lots of documents. Which one I use, Tika Server or Tika App or Java calls? Can you suggest me a system architecture? (i.e. Load balanced 3-4 Tika physically different Server)
1
vote
0 answers

Tell Tika not to parse XML

I would like to configure a Tika server that does not parse XML files. I wrote the following config file:
mbl
  • 101
  • 9
0
votes
0 answers

Tika Docx Scanning for 2 MB file (Pure text docx file) taking more than 30 seconds

I am using tika 2.6.x with java opts as XX:MaxMetaspaceSize=200M -Xss512K -XX:MaxDirectMemorySize=64M for below code. It seems that processing time is very high(around a minute) for a pure text containing docx file of size more than equals to 2 MB.…
DeadPool
  • 40
  • 8
0
votes
0 answers

Tika Parser is treating .pptx text content as embedded image

I am using tika parser to validate the content of various file types like .docx, .txt, .pptx and many more others. It seems that even for a normal text content pptx file, when running tika parser on it, its responding saying embedded image in file.…
DeadPool
  • 40
  • 8
0
votes
0 answers

Apache Tika returns 200 on broken PDFs

I'm using Apache Tika server in a Docker container to parse all sorts of files. I noticed that when sending a broken PDF to parse, Tika returned 200 and empty text. I added this line to my config.xml: true Which…
0
votes
0 answers

Issue with apache Tika Extraction for Tabular Column Data in PDF

I extracted a PDF that has tabular column data using apache Tika, in the result the row data from different columns are getting merged Before Extracting | Column A | Column B | | -------- | -------- | | 1 | saikiran | | 2 | pavan …
0
votes
0 answers

How to read the images with Tika without using Tesseract Installation

I am using tika parser and tika core 2.x dependencies and want to read the characters inside images. Is there anyway to achieve this using tika without using tesseract installation.
DeadPool
  • 40
  • 8
0
votes
0 answers

How to extract all of the binary files recursively

I have a .docx file that contains .pptx file that contains images. I'm trying to figure out how to extract all of the binary files recursively, so I will be able to get the .pptx but most importantly its images. I saw there's "/unpack" endpoint but…
Shay Barak
  • 61
  • 1
  • 4