I have a process that collects reports generated throughout the week and consolidates the collection to eliminate identical reports.
I've written a function that identifies identical reports by finding those that have identical indices, then it excludes all but one of those that are identical and moves on. While it works fine for 5000-10,000 reports, it starts to take a serious amount of time to process through, say, 50,000+ reports, which as time goes on will be more and more common.
It would be nice if I could pre-emptively eliminate the reports and so avoid this step, but the process of generating the reports doesn't allow for that. So, I want to find a way to make this or a similar function more efficient.
The code is below:
def report_diff_index(self,dnc_data,folders):
master_report_dict, master_val_dict = self.report_orderer(folders)
sorts = self.report_sorter(dnc_data,master_report_dict)
keys = [k for k in sorts.keys()]
consolidated_sorts = keys
print('Original Report Size: ', len(consolidated_sorts))
for k in keys:
if k in consolidated_sorts:
for j in keys[keys.index(k)+1:]:
if j in consolidated_sorts:
if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
consolidated_sorts.remove(j)
print('Consolidated Report Size: ', len(consolidated_sorts))
consolidated_report = {}
consolidated_val = {}
for s in consolidated_sorts:
consolidated_report[s] = master_report_dict[s]
consolidated_val[s] = master_val_dict[s]
return consolidated_report, consolidated_val