0

I have a dataset which contains huge amount of pictures (2 millions). A lot of pre-processing has been done and pictures are identified with id. Some of the ids do not exist, but they are generated (FYI, easier to code). This means that when I try to open an image, I surround it with a try/except block. If picture does not exist, I write to a log file and try to add that image identifier's name to a list. I might try to open the same file twice (actually needed for files which exist) and my reasoning was that if I add a picture's identifier to a list, I will not need to catch exception and code will run faster because I can just check if name of the file which does not exist is in the list and if it is, then I can just return None.

I provide some of the code:

     def __init__(self, real_frames_dataframe, fake_frames_dataframe,
                 augmentations, image_size=224):

        # Should increase training speed as on second epoch will not need to catch exceptions
        self.non_existent_files = []

    def __getitem__(self, index):
        row_real = self.real_df.iloc[index]
        row_fake = self.fake_df.iloc[index]

        real_image_name = row_real["image_path"]
        fake_image_name = row_fake["image_path"]

        # Will go here from second epoch
        if real_image_name in self.non_existent_files or fake_image_name in self.non_existent_files:
            return None

        try:
            img_real = Image.open(real_image_name).convert("RGB")
        except FileNotFoundError:
            log.info("Real Image not found: {}".format(real_image_name))
            self.non_existent_files.append(real_image_name)
            return None
        try:
            img_fake = Image.open(fake_image_name).convert("RGB")
        except FileNotFoundError:
            log.info("Fake Image not found: {}".format(fake_image_name))
            self.non_existent_files.append(fake_image_name)
            return None

The problem is that I can see the same identifier to be in the log file multiple times. For example:

Line 3201: 20:56:27, training.DeepfakeDataset, INFO Real Image not found: nvptcoxzah\nvptcoxzah_260.png
Line 3322: 21:23:13, training.DeepfakeDataset, INFO Real Image not found: nvptcoxzah\nvptcoxzah_260.png

I thought the identifier will be appended to non_existent_files and the next time I will not even try to open this file. However, this does not happen. Can anyone explain why?

MichiganMagician
  • 273
  • 2
  • 15
  • 1
    Are you using this Dataset through a PyTorch DataLoader? If so, is `num_workers` > 0? `non_existent_files` wouldn't be shared between workers and the multiple log entries might be due to different workers. – adeelh Jan 06 '21 at 02:02
  • That is correct, thanks @adeelh for the answer. – MichiganMagician Jan 06 '21 at 09:05
  • 1
    Great. As an aside, I would also recommend using a set instead of a list for `non_existent_files` for faster look-ups. – adeelh Jan 06 '21 at 09:52

0 Answers0