I have photos spread across folders and subfolders in my local directory. Besides that, I have a dataframe that contains names and paths of those photos. I would like to cross check (1) if the path/name exists in df and the photo doesn't exist, and (2) whether the photo exists and the path/name doesn't exist in df
This is what i've done so far:
(1) the path/name exists in df and the photo doesn't exist
missing_general_images = []
for index, row in bq_df.iterrows():
path_download = os.path.join('/home/jupyter/Downloads/multimedia', row['form_id'], row['general_image_name'])
if os.path.exists(path_download):
pass
else:
missing_general_images.append(row)
missing_general_images_df = pd.DataFrame(missing_general_images)
missing_general_images_df.to_csv (r'/home/jupyter/missing_general_images.csv', index = False, header=True)
(2) whether the photo exists and the path/name doesn't exist in df
rootdir = '/home/jupyter/Downloads/multimedia'
missing_table_values = []
for subdir, dirs, files in os.walk(rootdir):
if dirs==[567196, 493841]:
continue
else:
for file in files:
for index, row in bq_df.iterrows():
if file == row['image_name']:
continue
else:
missing_table_values.append(file)
missing_table_values_df = pd.DataFrame(missing_table_values)
missing_table_values_df.to_csv (r'/home/jupyter/missing_table_values.csv', index = False, header=True)
The problem is the second part of the code, because it takes years to create a list of missing values from dataframe. I guess because it has to iterate over each folder, subfolder and file, and there is like 40.000 files (cca 20 giga).
Do you recommend any faster way or how can i fasten up the process with the current code? Thanks a lot!
EDIT:
I made a list out of folders, subfolders and photos' names and intersect it with dataframe names/paths.
rootdir = '/home/jupyter/Downloads/multimedia'
list_of_photos = []
for path, subdirs, files in os.walk(rootdir):
for name in files:
list_of_photos.append(os.path.join(path, name))
missing_table_values = []
for name in list_of_photos:
if bq_df['image_path'].isin(list_of_photos) is False:
missing_table_values.append(name)