-1

I just started to use Spark for the first time for a OCR task, i have a folder of PDF files containing scanned text documents and I want to convert it to plain text. I first create a parallelized dataset of all the pdf's in the folder and perform a Map operation to create the images. I use Wand images for this task. Finally with a foreach i do the OCR using pytesseract, which is a wrapper for Tesseract.

The problem I have with this approach is that the memory use is increasing with each new document and finally i get an error "os cannot allocate memory". I have the feeling it stores the complete Img object in memory but all i need is a list of the locations of the temporary files. If I run this with a few PDF files it works but more then 5 files the system crashes...

def toImage(f):
    documentName = f[:-4]

    def imageList(imgObject):       
        #get list of generated images
        imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)

        if len(img.sequence) > 1:   
            images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
        else:
            images = [("{}.jpg".format(imagePrefix), documentName)]
        return images

    #store images for each file in tmp directory
    with WandImage(filename=path + f, resolution=300) as img:
        #create tmp directory
        if not os.path.exists(path + "tmp/" +  documentName):
            os.makedirs(path + "tmp/" +  documentName)

        #save images in tmp directory
        img.format = 'jpeg'
        img.save(filename=path + "tmp/" +  documentName + '/' + documentName + '.jpg')  
        imageL =  imageList(img)
        return imageL


def doOcr(imageList):
    print(imageList[0][1])
    content = "\n\n***NEWPAGE***\n\n".join([pytesseract.image_to_string(Image.open(fullPath), lang='nld') for fullPath, documentName in imageList])
    with open(path + "/txt/" + imageList[0][1] + ".txt", "w") as text_file:
        text_file.write(content)

sc = SparkContext(appName="OCR")
pdfFiles = sc.parallelize([f for f in os.listdir(sys.argv[1]) if f.endswith(".pdf")])
text = pdfFiles.map(toImage).foreach(doOCr)

Im using Ubuntu with 8gb memory Java 7 and Python3.5

Chris p
  • 43
  • 6
  • 2
    What makes you think there is memory leak? If you read files in parallel you'll there will be multiple files loaded at the same type. Also processing large objects is usually not the best use-case for Spark especially with restricted resources like here. Finally how do you set `spark.python.worker.memory`? – zero323 Mar 11 '16 at 09:33
  • I have set the memory in the conf/spark-defaults.conf file. I think there is a memory leak because I can see that files are finished with the OCR part and there is no decrease in used memory. – Chris p Mar 11 '16 at 10:27
  • What version of ImageMagick & Wand are you using? Several memory leak issues have been addressed in recent years. – emcconville Mar 13 '16 at 16:51
  • I have been using the latest version of both. I finally found a solution, I will update my question with the solution/fix – Chris p Mar 17 '16 at 09:55
  • 1
    @Chrisp if you have answered your own question dont update the question post the solution as an answer, this will help future users – Blake Lockley Mar 17 '16 at 10:00

1 Answers1

0

Update

I found a solution, the problem appears to be in the part where I create the imagelist, using:

def imageList(imgObject):       
        #get list of generated images
        # imagePrefix = "{}tmp/{}/{}".format(path,documentName,documentName)

        # if len(img.sequence) > 1: 
        #   images = [ ("{}-{}.jpg".format(imagePrefix, x.index), documentName) for x in img.sequence]
        # else:
        #   images = [("{}.jpg".format(imagePrefix), documentName)]

        fullPath = "{}tmp/{}/".format(path, documentName)
        images = [(fullPath + f, documentName) for f in os.listdir(fullPath) if f.endswith(".jpg")]

        return natsorted(images, key=lambda y: y[0])

works perfectly, but i'm not sure why.. Everything gets closed but still it remains in memory

Chris p
  • 43
  • 6