1

I have some small and large PDF's that I'm trying to parse in string format using python Tika. I've locally Tika server and the conversion works file with around 200mb file size but now I've 1.3gb pdf. So when I try to convert it parser.from_file(large.pdf) returns None. As per my guess it seems memory issue for large file.

So my basic question is Why is large pdf is returning None and How to overcome it?

Partial Code Snippets:

import os
import sys
import glob
from tika import tika, parser
from helpers.helper import file_paths

# Set the required path(s)
paths = file_paths()
pdf_path = paths.get('PDF_FILE_PATH')
text_path = paths.get('TEXT_FILE_PATH')
abs_path = os.path.dirname(os.path.join(os.getcwd(), __file__)) + "/server"

# Update the required variables
tika.log_path = os.getenv('TIKA_LOG_PATH', abs_path)
tika.TikaJarPath = os.getenv('TIKA_PATH', abs_path)
tika.TikaFilesPath = abs_path + "/logs"

def get_pdf_string(filename):
    """
    Write string to file
    """
    raw = parser.from_file(pdf_path + filename)
    new_file = filename.split('.')[0] + '.txt'
    with open(text_path + new_file, 'w') as write_encode:
        write_encode.write(raw['content'])

I'm also observing such messages for large pdf conversion only. What does it mean?

Terminal Log: while running python file

[MainThread ] [WARNI] Tika server returned status: 500

Server Log:

WARN /rmeta/text java.lang.OutOfMemoryError: Java heap space

A l w a y s S u n n y
  • 36,497
  • 8
  • 60
  • 103

1 Answers1

0

You can try to split the PDF into Pages using for instance pdfbox and then send page by page to tika

marek.kapowicki
  • 674
  • 2
  • 5
  • 17
  • Thanks for the suggestion sir, I was also thinking that into split the PDF to multiple parts but I was thinking there should be a way to increase java heap space for tika to parse large pdf. – A l w a y s S u n n y Jan 09 '21 at 01:20
  • 1
    You csan either run tika as a docker image: https://medium.com/@masreis/text-extraction-and-ocr-with-apache-tika-302464895e5f - then you define the docker parameters Or the tika is a 3rd part library (jar) used by your java app - then you need to tune jvm on your app. – marek.kapowicki Jan 09 '21 at 10:08