2

I have a dataframe that looks like this:

    start   stop   duration
0   1       2      1
1   3       4      2
2   2       1      2
3   4       3      0

I'm trying to build a dictionary with key= (start, stop) pairs and the value= avg of their duration, regardless of the order. In other words, (1,2) and (2,1) would both count as an occurrence of the pair (1,2).

Desired output: dict_avg= {(1,2):1.5, (3,4):1}

What's the best way to achieve this?

Caerus
  • 674
  • 1
  • 8
  • 19

4 Answers4

3

Here's one possible approach by taking a tuple of the sorted pairs:

>>> grp = df[['start', 'stop']].apply(lambda x: tuple(sorted(x)), axis=1)
>>> df.groupby(grp)['duration'].mean().to_dict()
{(1, 2): 1.5, (3, 4): 1.0}

As a disclaimer, I can almost guarantee this will be significantly slower than the NumPy-sort given here, as using a lambda within .apply() (and needing to use sorted() + tuple() constructor) takes each call in the Python space, rather than letting it be done in Cython/C as you can ideally do via Pandas/NumPy.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
3

Using frozenset not fast but neat , if you looking for efficient way check link

df.groupby(df[['start','stop']].apply(frozenset,1).map(tuple)).duration.mean().to_dict()
Out[1048]: {(1, 2): 1.5, (3, 4): 1.0}
BENY
  • 317,841
  • 20
  • 164
  • 234
3

This is also one way:

# sort data based on first two columns
df.iloc[:,:2].values.sort()

# create the dict of mean
df.groupby(['start','stop'])['duration'].mean().to_dict()

{(1, 2): 1.5, (3, 4): 1.0}
YOLO
  • 20,181
  • 5
  • 20
  • 40
3

defaultdict

from collections import defaultdict

m = defaultdict(list)

for *t, d in zip(*map(df.get, df)):
    m[tuple({*t})].append(d)

{k: sum(v) / len(v) for k, v in m.items()}

{(1, 2): 1.5, (3, 4): 1.0}
piRSquared
  • 285,575
  • 57
  • 475
  • 624