Python: Removing list duplicates based on first 2 inner list values

Question

Question:

I have a list in the following format:

x = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]

The algorithm:

Combine all inner lists with the same starting 2 values, the third value doesn't have to be the same to combine them
- e.g. "hello",0,5 is combined with "hello",0,8
- But not combined with "hello",1,1
The 3rd value becomes the average of the third values: sum(all 3rd vals) / len(all 3rd vals)
- Note: by all 3rd vals I am referring to the 3rd value of each inner list of duplicates
- e.g. "hello",0,5 and "hello",0,8 becomes hello,0,6.5

Desired output: (Order of list doesn't matter)

x = [["hello",0,6.5], ["hi",0,6], ["hello",1,1]]

Question:

How can I implement this algorithm in Python?

Ideally it would be efficient as this will be used on very large lists.

If anything is unclear let me know and I will explain.

Edit: I have tried to change the list to a set to remove duplicates, however this doesn't account for the third variable in the inner lists and therefore doesn't work.

Solution Performance:

Thanks to everyone who has provided a solution to this problem! Here are the results based on a speed test of all the functions:

wjandrea · Accepted Answer · 2019-12-07T23:17:04.237

2

Update using running sum and count

I figured out how to improve my previous code (see original below). You can keep running totals and counts, then compute the averages at the end, which avoids recording all the individual numbers.

from collections import defaultdict

class RunningAverage:
    def __init__(self):
        self.total = 0
        self.count = 0

    def add(self, value):
        self.total += value
        self.count += 1

    def calculate(self):
        return self.total / self.count

def func(lst):
    thirds = defaultdict(RunningAverage)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].add(sub[2])
    lst_out = [[*k, v.calculate()] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

Original answer

This probably won't be very efficient since it has to accumulate all the values to average them. I think you could get around that by having a running average with a weighting factored in, but I'm not quite sure how to do that.

from collections import defaultdict

def avg(nums):
    return sum(nums) / len(nums)

def func(lst):
    thirds = defaultdict(list)
    for sub in lst:
        k = tuple(sub[:2])
        thirds[k].append(sub[2])
    lst_out = [[*k, avg(v)] for k, v in thirds.items()]
    return lst_out

print(func(x))  # -> [['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

edited Dec 07 '19 at 23:17

answered Dec 07 '19 at 20:02

wjandrea

28,235
9
60
81

Awesome, thanks! I will accept it soon if it remains the most efficient solution :) – Rlz Dec 07 '19 at 20:05
I just profiled both your new and original code, it seems the original is still slightly faster? (Compared to the others the original was the fastest :) – Rlz Dec 08 '19 at 00:15
@Ruler Huh, I guess there's a lot of lookups in the new one, but I'm surprised it was slower. How big was the dataset? – wjandrea Dec 08 '19 at 00:26
@Ruler I guess it also matters what the dataset looks like cause the old one is probably better for broader sets (more keys, fewer values) but the newer one is better for narrower sets (fewer keys, more values) – wjandrea Dec 08 '19 at 00:29
I'll try it with some different datasets and see what I get, I'm trying between 1-100000 values in the dataset – Rlz Dec 08 '19 at 00:33
1

Just tried two datasets: one with a merge every 1000 items and one with a merge every 2 items. Your original was faster in both scenarios – Rlz Dec 08 '19 at 00:39

score 2 · Answer 2 · answered Dec 07 '19 at 20:05

2

You can try using groupby.

m = [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
from itertools import groupby
m.sort(key=lambda x:x[0]+str(x[1]))

for i,j in groupby(m, lambda x:x[0]+str(x[1])):
    ss=0
    c=0.0
    for k in j:
        ss+=k[2]
        c+=1.0
    print [k[0], k[1], ss/c]

answered Dec 07 '19 at 20:05

vks

67,027
10
91
124

Would this work better by changing print to yield and making a function? The list can then be obtained by `x = list(func(x))` – Rlz Dec 07 '19 at 20:10
@RulerOfTheWorld i will leave that to you :) You atleast have 2 algo's now :) – vks Dec 07 '19 at 20:11
2

Instead of lambdas as the sort keys, you could use `operator.itemgetter(slice(2))` – wjandrea Dec 07 '19 at 20:31

score 2 · Answer 3 · answered Dec 07 '19 at 20:29

This should be O(N), someone correct me if I'm wrong:

def my_algorithm(input_list):
    """
    :param input_list: list of lists in format [string, int, int]
    :return: list
    """

    # Dict in format (string, int): [int, count_int]
    # So our list is in this format, example:
    # [["hello",0,5], ["hi",0,6], ["hello",0,8], ["hello",1,1]]
    # so for our dict we will make keys a tuple of the first 2 values of each sublist (since that needs to be unique)
    # while values are a list of third element from our sublist + counter (which counts every time we have a duplicate
    # key, so we can divide it and get average).
    my_dict = {}
    for element in input_list:
        # key is a tuple of the first 2 values of each sublist
        key = (element[0], element[1])
        if key not in my_dict:
            # If the key do not exists add it.
            # Value is in form of third element from our sublist + counter. Since this is first value set counter to 1
            my_dict[key] = [element[2], 1]
        else:
            # If key does exist then increment our value and increment counter by 1
            my_dict[key][0] += element[2]
            my_dict[key][1] += 1

    # we have a dict so we will need to convert it to list (and on the way calculate averages)
    return _convert_my_dict_to_list(my_dict)


def _convert_my_dict_to_list(my_dict):
    """
    :param my_dict: dict, key is in form of tuple (string, int) and values are in form of list [int, int_counter]
    :return: list
    """
    my_list = []
    for key, value in my_dict.items():
        sublist = [key[0], key[1], value[0]/value[1]]
        my_list.append(sublist)
    return my_list

my_algorithm(x)

This will return:

[['hello', 0, 6.5], ['hi', 0, 6.0], ['hello', 1, 1.0]]

While your expected return is:

[["hello", 0, 6.5], ["hi", 0, 6], ["hello", 1, 1]]

If you really need ints then you can modify _convert_my_dict_to_list function.

cdlane · Answer 4 · 2019-12-07T21:04:01.817

Here's my variation on this theme: a groupby sans the expensive sort. I also changed the problem to make the input and output a list of tuples as these are fixed-size records:

from itertools import groupby
from operator import itemgetter
from collections import defaultdict

data = [("hello", 0, 5), ("hi", 0, 6), ("hello", 0, 8), ("hello", 1, 1)]

dictionary = defaultdict(complex)

for key, group in groupby(data, itemgetter(slice(2))):
    total = sum(value for (string, number, value) in group)
    dictionary[key] += total + 1j

array = [(*key, value.real / value.imag) for key, value in dictionary.items()]

print(array)

OUTPUT

> python3 test.py
[('hello', 0, 6.5), ('hi', 0, 6.0), ('hello', 1, 1.0)]
>

Thanks to @wjandrea for the itemgetter replacement for lambda. (And yes, I am using complex numbers in passing for the average to track the total and count.)

Python: Removing list duplicates based on first 2 inner list values

Question:

Solution Performance:

4 Answers4

Update using running sum and count

Original answer