I just started to use Spark for the first time for a OCR task, i have a folder of PDF files containing scanned text documents and I want to convert it to plain text. I first create a parallelized dataset of all the pdf's in the folder and perform a Map operation to create the images. I use Wand images for this task. Finally with a foreach i do the OCR using pytesseract, which is a wrapper for Tesseract.
The problem I have with this approach is that the memory use is increasing with each new document and finally i get an error "os cannot allocate memory". I have the feeling it stores the complete Img object in memory but all i need is a list of the locations of the temporary files. If I run this with a few PDF files it works but more then 5 files the system crashes...
def toImage(f):
documentName = f[:-4]
def imageList(imgObject):
#get list of generated images
imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)
if len(img.sequence) > 1:
images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
else:
images = [("{}.jpg".format(imagePrefix), documentName)]
return images
#store images for each file in tmp directory
with WandImage(filename=path + f, resolution=300) as img:
#create tmp directory
if not os.path.exists(path + "tmp/" + documentName):
os.makedirs(path + "tmp/" + documentName)
#save images in tmp directory
img.format = 'jpeg'
img.save(filename=path + "tmp/" + documentName + '/' + documentName + '.jpg')
imageL = imageList(img)
return imageL
def doOcr(imageList):
print(imageList[0][1])
content = "\n\n***NEWPAGE***\n\n".join([pytesseract.image_to_string(Image.open(fullPath), lang='nld') for fullPath, documentName in imageList])
with open(path + "/txt/" + imageList[0][1] + ".txt", "w") as text_file:
text_file.write(content)
sc = SparkContext(appName="OCR")
pdfFiles = sc.parallelize([f for f in os.listdir(sys.argv[1]) if f.endswith(".pdf")])
text = pdfFiles.map(toImage).foreach(doOCr)
Im using Ubuntu with 8gb memory Java 7 and Python3.5