I have some small and large PDF's that I'm trying to parse in string format using python Tika. I've locally Tika server and the conversion works file with around 200mb file size but now I've 1.3gb pdf. So when I try to convert it parser.from_file(large.pdf)
returns None
. As per my guess it seems memory issue for large file.
So my basic question is Why is large pdf is returning None
and How to overcome it?
Partial Code Snippets:
import os
import sys
import glob
from tika import tika, parser
from helpers.helper import file_paths
# Set the required path(s)
paths = file_paths()
pdf_path = paths.get('PDF_FILE_PATH')
text_path = paths.get('TEXT_FILE_PATH')
abs_path = os.path.dirname(os.path.join(os.getcwd(), __file__)) + "/server"
# Update the required variables
tika.log_path = os.getenv('TIKA_LOG_PATH', abs_path)
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaFilesPath = abs_path + "/logs"
def get_pdf_string(filename):
"""
Write string to file
"""
raw = parser.from_file(pdf_path + filename)
new_file = filename.split('.')[0] + '.txt'
with open(text_path + new_file, 'w') as write_encode:
write_encode.write(raw['content'])
I'm also observing such messages for large pdf conversion only. What does it mean?
Terminal Log: while running python file
[MainThread ] [WARNI] Tika server returned status: 500
Server Log:
WARN /rmeta/text java.lang.OutOfMemoryError: Java heap space