Tika server with python returns None for large file but works file with small pdfs

Question

I have some small and large PDF's that I'm trying to parse in string format using python Tika. I've locally Tika server and the conversion works file with around 200mb file size but now I've 1.3gb pdf. So when I try to convert it parser.from_file(large.pdf) returns None. As per my guess it seems memory issue for large file.

So my basic question is Why is large pdf is returning None and How to overcome it?

Partial Code Snippets:

import os
import sys
import glob
from tika import tika, parser
from helpers.helper import file_paths

# Set the required path(s)
paths = file_paths()
pdf_path = paths.get('PDF_FILE_PATH')
text_path = paths.get('TEXT_FILE_PATH')
abs_path = os.path.dirname(os.path.join(os.getcwd(), __file__)) + "/server"

# Update the required variables
tika.log_path = os.getenv('TIKA_LOG_PATH', abs_path)
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaFilesPath = abs_path + "/logs"

def get_pdf_string(filename):
    """
    Write string to file
    """
    raw = parser.from_file(pdf_path + filename)
    new_file = filename.split('.')[0] + '.txt'
    with open(text_path + new_file, 'w') as write_encode:
        write_encode.write(raw['content'])

I'm also observing such messages for large pdf conversion only. What does it mean?

Terminal Log: while running python file

[MainThread ] [WARNI] Tika server returned status: 500

Server Log:

WARN /rmeta/text java.lang.OutOfMemoryError: Java heap space

@KlausD. Yes sir, Added the tika-server.log for memory issue of java heap space — A l w a y s S u n n y, Jan 07 '21 at 01:24
Yep, the server ran out of memory. Your problem is unrelated to your code. — Klaus D., Jan 07 '21 at 01:27
I’m voting to close this question because the problem is not caused by the code but by a remote server error. — Klaus D., Jan 07 '21 at 01:29
No the server is on my local pc, I have 16gb ram on my pc, so can I tweak that for java heap space to discard OutOfMemoryError? — A l w a y s S u n n y, Jan 07 '21 at 01:29
Setting up the remote server is where we leave the scope of this question about Python code and the topics handle by SO. — Klaus D., Jan 07 '21 at 01:32

score 0 · Answer 1 · answered Jan 08 '21 at 12:18

0

You can try to split the PDF into Pages using for instance pdfbox and then send page by page to tika

answered Jan 08 '21 at 12:18

marek.kapowicki

674
2
5
17

Thanks for the suggestion sir, I was also thinking that into split the PDF to multiple parts but I was thinking there should be a way to increase java heap space for tika to parse large pdf. – A l w a y s S u n n y Jan 09 '21 at 01:20
1

You csan either run tika as a docker image: https://medium.com/@masreis/text-extraction-and-ocr-with-apache-tika-302464895e5f - then you define the docker parameters Or the tika is a 3rd part library (jar) used by your java app - then you need to tune jvm on your app. – marek.kapowicki Jan 09 '21 at 10:08

Tika server with python returns None for large file but works file with small pdfs

1 Answers1