Questions tagged [tika-server]

90 questions
0
votes
2 answers

Tika Server not reading embedded images in PDFs

Hi Tika Server is setup with tesseract but still it is not reading embedded images in PDFs. Tried using the two headers available, but not help. This is happening for PDF files only. While, OCR works for other file types/images. Using customized…
S. Das
  • 93
  • 2
  • 10
0
votes
0 answers

Tika server returned status: 404

I'm trying to setup Tika for text extraction using python. I've installed Java runtime jre 1.8.0, Installed tika with pip install tika==1.23, Downloaded the tika server jar file from this link, and as mentioned in this page, I've added variable…
0
votes
0 answers

Python Tika error: URLError:

I've been using a lot python tika to exctract text from some pdfs. Suddenly Tika doesn't work any more with the following code and similar: from tika import parser document = parser.from_file("prova.pdf")['content'] or from tika import…
0
votes
1 answer

Empty Parser and Tika Server mode

I am having trouble understanding how parsers are loaded into Tika. From their documentation it appears that Tika-app comes prepackaged with the parsers (https://tika.apache.org/1.17/gettingstarted.html). When I run this command to start the server…
mlanier
  • 167
  • 2
  • 3
  • 14
0
votes
1 answer

Tika extra space between letters - is there any way to use setEnableAutoSpace via Web API?

I'm running the stock Apache Tika 1.24.1 Server (tika-server-1.24.1.jar). My ASP.NET MVC web app then gets the parsed documents back from Tika using this VB.net code: httpWebRequest =…
Taraz
  • 1,242
  • 13
  • 13
0
votes
0 answers

org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE

when running nutch jobs it is showing as Oct 13, 2020 8:46:18 AM org.apache.tika.utils.XMLReaderUtils acquireSAXParser WARNING: Contention waiting for a SAXParser. Consider increasing the XMLReaderUtils.POOL_SIZE May I know what it means.I using…
Ravi Kiran
  • 65
  • 6
0
votes
1 answer

Apche Tika: How to save console log to a file. Use log4j?

Apache Tika 1.24.1. I read that there is a logging facility called log4j, but didn't find a quick to copy example. Does tika have some command line argument to save console logs to a file? Thanks.
freeAR
  • 943
  • 3
  • 18
  • 32
0
votes
1 answer

Apache TIKA - MediaDataBox iso files

It seems that Apacke Tika 1.24.1 is creating lots of /tmp/MediaDataBox ISO files, and my /tmp partition gets filled up. What is MediaDataBox ISO file used for? Can we somehow tell Tika to save it in another directory? Tika runs in server mode as…
freeAR
  • 943
  • 3
  • 18
  • 32
0
votes
0 answers

Unable to parse MP4 files -MemoryAllocationException: Tried to allocate X bytes, but the limit for this record type is:Y

I am using Tika server to fetch metadata and contents of various file formats. I am using server with fileUrl enabled. When parsing .mov file which are created using quicktime screen record, it gives me the following error. Text extraction failed…
Balu
  • 456
  • 8
  • 19
0
votes
1 answer

How to ignore scanned image in tika

I'm trying to parse pdf files in tika. In some handwritten scanned documents, tika is parsing the file and returning garbage text that does not make sense. I'm using python tika wrapper from here. Is there some way to ignore pdfs that contain…
pramesh
  • 1,914
  • 1
  • 19
  • 30
0
votes
0 answers

How to export paragraphs in one line using Apache Tika

I am passing a PDF document to the Apache Tika software, which is in this format: PDF document with paragraphs like these: (iii) 50% of Text Text Text Text Text Text Text Text Text Text Text Text Text. Text Text Text Text Text Text Text 1 Text…
mshikher
  • 174
  • 3
  • 20
0
votes
2 answers

NiFi Parse PDF using Python Tika error: ExecuteStreamCommand

I'm trying to do the following, but I'm getting errors on my ExecuteStreamCommand: Cannot run program "C:\Python36\pythonscript.py" error=193 not a valid Win32 application" This is being run on my home Windows work station. GetFile (Get my…
0
votes
1 answer

Tika parser python with Docker giving RuntimeError: Details: Unable to start Tika server

Without Docker the scripts are able to parse the pdf files using tika. But however when I'm trying with Docker..I get the following error for the tika server not running: with some reading I tried the following - but the error persists. Can some…
Space X
  • 97
  • 1
  • 7
0
votes
1 answer

TIKA Server extract embedded resources

I'm making some tests with the TIKA-app (v 1.23) to extract embedded resources from the input-file, which works great by specifying the -z parameter on the command-line using the app. This parameter enables embedded resource extraction and writes…
TVA van Hesteren
  • 1,031
  • 3
  • 20
  • 47
0
votes
1 answer

How to set TIKA_SERVER_ENDPOINT from tika-python lib

The excellent lib tika-python in its documentation at https://github.com/chrismattmann/tika-python shows that it is possible to set the tika_server.jar file to avoid downloading with each use of the algorithm. Has anyone done this and can post the…
erfelipe
  • 460
  • 4
  • 14