Check if files from lists of files exists

Question

I have photos spread across folders and subfolders in my local directory. Besides that, I have a dataframe that contains names and paths of those photos. I would like to cross check (1) if the path/name exists in df and the photo doesn't exist, and (2) whether the photo exists and the path/name doesn't exist in df

This is what i've done so far:

(1) the path/name exists in df and the photo doesn't exist

missing_general_images = []
for index, row in bq_df.iterrows():
    
    path_download = os.path.join('/home/jupyter/Downloads/multimedia', row['form_id'], row['general_image_name'])
    if os.path.exists(path_download):
        pass
    else:
        missing_general_images.append(row) 
        
missing_general_images_df = pd.DataFrame(missing_general_images)
missing_general_images_df.to_csv (r'/home/jupyter/missing_general_images.csv', index = False, header=True)

(2) whether the photo exists and the path/name doesn't exist in df

rootdir = '/home/jupyter/Downloads/multimedia'

missing_table_values = []
for subdir, dirs, files in os.walk(rootdir):
    if dirs==[567196, 493841]:
        continue
    else:
        for file in files:
            for index, row in bq_df.iterrows():
                if file == row['image_name']:
                    continue
                else:
                    missing_table_values.append(file) 

missing_table_values_df = pd.DataFrame(missing_table_values)
missing_table_values_df.to_csv (r'/home/jupyter/missing_table_values.csv', index = False, header=True)

The problem is the second part of the code, because it takes years to create a list of missing values from dataframe. I guess because it has to iterate over each folder, subfolder and file, and there is like 40.000 files (cca 20 giga).

Do you recommend any faster way or how can i fasten up the process with the current code? Thanks a lot!

EDIT:

I made a list out of folders, subfolders and photos' names and intersect it with dataframe names/paths.

rootdir = '/home/jupyter/Downloads/multimedia'

list_of_photos = []
for path, subdirs, files in os.walk(rootdir):
    for name in files:
        list_of_photos.append(os.path.join(path, name)) 


        
missing_table_values = []
for name in list_of_photos:
    if bq_df['image_path'].isin(list_of_photos) is False:
        missing_table_values.append(name)

You check `row['image_name'] is not None` twice, even though it can't have changed in between. You do a linear search through the dataframe for every file. Build a lookup table once, or even just a set of filenames from the dataframe, which will have faster checking. For each directory you could build a set of files, and calculate the intersection of sets. — Peter Wood, Jun 14 '21 at 06:48
Thanks a lot. Yes, i've already corrected it but as i said, the problem is in the second part. I'll try to do what you've suggested. — KayEss, Jun 14 '21 at 06:53
It's not really clear what you're asking. I don't know what a dataframe is. What does `bq_df['general_image_path'].index` mean? Why would it be a filepath? — Peter Wood, Jun 14 '21 at 07:54
Also, if you're only checking for existence in a set of file paths, a list is the wrong data type as it would have to do a linear search. Use a `set` which is likely to support binary search, or even a hash table lookup which is O(1). See https://wiki.python.org/moin/TimeComplexity — Peter Wood, Jun 14 '21 at 07:55
I removed index. I guess this is it now? Thanks for your navigation. — KayEss, Jun 15 '21 at 05:26

Check if files from lists of files exists

0 Answers0