0

I have a script that uses the MTCNN face detection library that iterates through a fair amount of directories, totaling thousands of images. An issue that I've been running into with this script is the excessive memory usage when processing all of these images, which will eventually cause my MacBook (16gb of RAM) to run out of memory. What I'm looking to do is to implement batching on a folder by folder basis, instead of a specific batch limit because none of the folders contain enough images individually that would make the system run out of memory.

# open up the csv file
with open(csv_path, 'w', newline='') as file:
    writer = csv.writer(file)
    writer.writerow(['Index', 'Threshhold', 'Path'])

for path, subdirs, files in os.walk(path):
    for name in files:
        if name == '.DS_Store':
            print("Skipping .DS_Store")
            continue
        else:
            try:
                image = os.path.join(path, name)
                pixels = pyplot.imread(image)

                print("Processing " + image)
                print("Count: " + str(inc))

                # calculate the area of the image
                total_height = pixels.shape[0]
                total_width = pixels.shape[1]

                total_area = total_height * total_width

                # create the detector, using default weights
                detector = MTCNN()
                faces = detector.detect_faces(pixels)

                ax = pyplot.gca()

                face_total_area = 0

                if faces == []:
                    print("No faces detected.")
                    # pass in 0 for the threshold becuase there's no faces
                    #write_to_csv(inc, 0, image)
                    print()
                else:
                    for face in faces:
                        # get dimensions from the face
                        x, y, width, height = face['box']

                        # calculate the area of the face
                        face_area = width * height
                        face_total_area += face_area

                    threshold = face_total_area / total_area

                    # write to csv only if the threshold is less than the limit 
                    # change back to this eventually ^^^^^^^^^
                    if threshold > threshhold_limit:
                        print("Facial area is over the threshold - writing file path to csv.")
                        write_to_csv(inc, threshold, image)
                    else:
                        print("Image threshold is under the limit - good")

                    print(threshold)
                    print()

                inc += 1
            except:
                print("Processing error - skipping image")

Is something like this possible to do? Or should it be done a different way? The idea is that batching like this will allow mtcnn to release the memory it's holding onto when it's done processing that folder.

Matt Neis
  • 11
  • 1
  • 3

1 Answers1

0

Memory usage should not increase with this program, because it does not accumulate data from one image to the next one. So, what you are asking for will have no effect. Have you tried runnng tis same code outside of a Python notebook? As a standalone program? It may be that the notebook is keeping references to all read images. Either that, or find a call that would really reset pyplot's internal state inside the innermost loop. (maybe pyplot.clf()).

"Batching" as you say is what takes place inside the first for loop, which will run once for each folder in your tree. The only bennefit you could possibly have would be to reset the internal state inside the first loop, but outside the second for (for name in ...), you'd have to find the exactly same call to reset the internal state.

(also, on a side note, you create a csv writer in your with block that is invalidated at the end of the block - you should refactor this code not to keep reopening the CSV file for each new line - (which happens in the not-shown write_to_csv function) )

jsbueno
  • 99,910
  • 10
  • 151
  • 209