How to efficiently find identical indices in multiple dataframes

Question

I have a process that collects reports generated throughout the week and consolidates the collection to eliminate identical reports.

I've written a function that identifies identical reports by finding those that have identical indices, then it excludes all but one of those that are identical and moves on. While it works fine for 5000-10,000 reports, it starts to take a serious amount of time to process through, say, 50,000+ reports, which as time goes on will be more and more common.

It would be nice if I could pre-emptively eliminate the reports and so avoid this step, but the process of generating the reports doesn't allow for that. So, I want to find a way to make this or a similar function more efficient.

The code is below:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_report_dict)
    keys = [k for k in sorts.keys()]
    consolidated_sorts = keys
    print('Original Report Size: ', len(consolidated_sorts))
    for k in keys:
        if k in consolidated_sorts:
            for j in keys[keys.index(k)+1:]:
                if j in consolidated_sorts:
                    if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
                        consolidated_sorts.remove(j)
    print('Consolidated Report Size: ', len(consolidated_sorts))
    consolidated_report = {}
    consolidated_val = {}
    for s in consolidated_sorts:
        consolidated_report[s] = master_report_dict[s]
        consolidated_val[s] = master_val_dict[s]
    return consolidated_report, consolidated_val

score 1 · Accepted Answer · answered Jun 27 '17 at 19:13

I do not know if I understand your problem correctly, and even if I do, I do not know if this is faster, but wouldn't it be possible to create a dict, where you use the unique report index as key (e.g. using frozenset) and then the report key as value. It feels like a quicker way to build the unique list, but I may be off:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_report_dict)
    print('Original Report Size: ', len(sorts))
    unique_reports = dict()
    for report_key, report in sorts.items:
        key = frozenset(report.index)
        # Alt 1. Replace with new (identical) repoirt
        unique_reports[key] = report_key
        # Alt 2. Keep first report
        if key not in unique_reports:
            unique_reports[key] = report_key
    consolidated_sorts = unique_reports.values()
    print('Consolidated Report Size: ', len(consolidated_sorts))
    consolidated_report = {}
    consolidated_val = {}
    for s in consolidated_sorts:
        consolidated_report[s] = master_report_dict[s]
        consolidated_val[s] = master_val_dict[s]
    return consolidated_report, consolidated_val

As you can see, there is also two options in the dict update, which depends on whether you want to keep the first found report or if it does not matter.

Insertion into a dict should approach O(1) therefore I would imagine this to be rather quick.

You cannot use it directly, but any immutable, i.e. hashable, data can be used as a key. `set`s and `list`s are mutable, thus not usable as dict `key`s, however, `frozenset`s and `tuple`s are immutable, and can thus be used. Thus, turning your indices to frozen sets make them usable as keys. — JohanL, Jun 27 '17 at 21:34
Thanks a lot for the suggestion! I implemented it using tuples as the keys, as brief testing showed their instantiation to be quicker than that of frozen sets, and it reduced the processing time of 100,000+ reports to 4.5 seconds!! Amazing. Thanks a lot! — Jed, Jul 04 '17 at 10:56
@Dorian821 Glad I could help. As you mention, `tuple`:s are quicker to generate than `set`:s. Keep in mind, however, that for tuples `(1, 2)` is not the same as `(2, 1)`. Thus the order of the elements is important. That may not be a concern for your current use case, but in some cases it makes the `frozenset` a better option. — JohanL, Jul 04 '17 at 16:12

score 0 · Answer 2 · answered Jun 27 '17 at 18:26

0

Correct me if I am wrong, but it looks like everything in:

consolidated_sorts = keys
print('Original Report Size: ', len(consolidated_sorts))
for k in keys:
    if k in consolidated_sorts:
        for j in keys[keys.index(k)+1:]:
            if j in consolidated_sorts:
                if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:
                    consolidated_sorts.remove(j)

is just about finding unique reports. In fact, the iterations are redundant, because you first set consolidated_sorts equal to keys, then iterate those values and ask if they are in consolidated_sorts, which is where they came from.

If you simply want unique keys, you may try something like this:

def report_diff_index(self,dnc_data,folders):
    master_report_dict, master_val_dict = self.report_orderer(folders)
    sorts = self.report_sorter(dnc_data,master_tree)

    # New code to create unique set of keys
    unique_keys = set(sorts.keys())

    consolidated_report = {}
    consolidated_val = {}

    for key in unique_keys:
        consolidated_report[key] = master_report_dict[key]
        consolidated_val[key] = master_val_dict[key]

    return consolidated_report, consolidated_val

answered Jun 27 '17 at 18:26

jack6e

1,512
10
12

Thanks, but the issue isn't with finding unique keys, but with finding unique indices for the values in the dictionary. This is why I'm comparing values against eachother. – Jed Jun 27 '17 at 18:38
Are you using an `ordered dict` that you think your dict has indices? Or how is your `dict` structured? Also, I am surprised that this code currently works for you. What do you understand this line to be doing: `if len(list(set(sorts[k].index).symmetric_difference(sorts[j].index))) == 0:`? – jack6e Jun 27 '17 at 19:09
@jack6e Not that I am Dorian821 but that line will, however somewhat contrived, compare the two lists of report keys and check for equality, independently of the key order in the lists. I would imagine a simple `set(sorts[k].index) == set(sorts[j].index)` would make more sense, but the result should be the same. – JohanL Jun 27 '17 at 19:48
I suppose he is trying to do that, but trying to run that same code returns a TypeError for me, that the `.index` method, as a built-in, is not iterable. I wonder if he is actually trying to do something along the lines of `keys.index(k)` and `keys.index(j)`. Or he could be trying `consolidated_sorts.index(k/j)`, etc. I still cannot get past the idea that OP is trying to find "unique indices for the values in the dictionary" but not "unique keys", as if `dict`s had indices and not keys. Maybe the OP's confusion between lists and dicts is what makes this confusing. – jack6e Jun 27 '17 at 20:04
thanks for the comments guys. the confusion here seems to be derived from the fact that the values of the dict are dataframes, not lists, consequently, the index is called on a pandas dataframe, as stated in the original post. granted, this might not be the best approach overall, just what I've got going at the moment. – Jed Jun 27 '17 at 20:59
@JohanL comparing the sets is obviously simpler and more intuitive. thanks. I'll try that. but I'm curious if there is a different approach overall. The answer below suggests using the indices as dict keys, but is this possible? – Jed Jun 27 '17 at 21:01

How to efficiently find identical indices in multiple dataframes

2 Answers2