0

This question is not completely the same as the greedy set cover problem, but they share the same idea.

Given a Pandas dataframe df1 with one column df['s'] composed of a set of keys of df2:

import numpy as np
import pandas as pd
>>> df = pd.DataFrame(np.array([set([1,3,5]), set([1,3,5,6]), set([2,3,4,12]), set([1,3,7]), set([1,15,11]), set([1,16]), set([16])]),columns=['s'])
>>> df
                    s
0      set([1, 3, 5])
1   set([1, 3, 5, 6])
2  set([12, 2, 3, 4])
3      set([1, 3, 7])
4    set([1, 11, 15])
5        set([1, 16])
6           set([16])
        ...

>>> df2 = pd.DataFrame(np.array([[1,2,3,3,3,6,4,8,9,10,11,12,13,14,15,16,5,7],[2.,1.,3.,2.,1.,2.,3.,1.,1.,1.,1.,1.,1.,1.,1.,16.,1.,1.]]).T,columns=['key', 'value'])
>>> df2
    key  value
0     1      2
1     2      1
2     3      3
3     3      2
4     3      1
5     6      2
6     4      3
7     8      1
8     9      1
9    10      1
10   11      1
11   12      1
12   13      1
13   14      1
14   15      1
15   16     16
16    5      1
17    7      1

    ...

The dataframe df2 above can contain duplicate keys. We choose the last one. For example, pick the value "1.0" for the key "3" above.

I want to find top six rows of df['s'] that can make the summation of the values of their corresponding keys in maximum, and sort the rows of the new dataframe by their value contribution. What is the fastest way to do this?

For the given data set above, the first two rows of the result dataframe should be

df3:
    set([1,16])
    set([12,2,3,4])
    ...

The second above is not set([16]), because "16" is already contained in set([1,16]), and the added value is zero from set([16]).

sorted by the summation of the corresponding values of keys of the set.

UPdate:

To make this problem simple, let's consider df2 contains only unique keys. And it can be easily fixed based on Andrew's trick.

Rex
  • 2,097
  • 5
  • 16
  • 18
  • Do you have a reasonable bound on the key values, e.g. 1..n? Since then this would seem to reduce to some basic linear algebra, which knowing pandas/numpy might be the fastest way to do this. You could have a len(df1['s']) x n matrix to represent the sets in df1['s'] and then an n-length vector to represent df2. Constructing the sets matrix might be annoying but for the df2 'weights' vector you want something like df2.drop_duplicates('key', take_last=True). – Andrew Rosenfeld Sep 13 '15 at 22:58
  • The keys are some unknown digits. It should treat them as string, since a key can be "0001". – Rex Sep 13 '15 at 23:03
  • Okay, do you have a bound on the number of distinct keys? What do you expect the rough dimensions to be of df1 and df2? – Andrew Rosenfeld Sep 15 '15 at 07:46
  • The rough dimension of df1 and df2 are about 10k rows. – Rex Sep 15 '15 at 18:23

1 Answers1

1

Assuming you don't have too many keys, you could represent your list of sets as a sparse matrix, with a column for each key.

In [29]: df = pd.DataFrame([{1:1,3:1,5:1}, {1:1,3:1,5:1,6:1}, {2:1,3:1,4:1,12:1}, {1:1,3:1,7:1}, {1:1,15:1,11:1}, {9:1}, {16:1}]).fillna(0)

In [30]: df
Out[30]: 
   1   2   3   4   5   6   7   9   11  12  15  16
0   1   0   1   0   1   0   0   0   0   0   0   0
1   1   0   1   0   1   1   0   0   0   0   0   0
2   0   1   1   1   0   0   0   0   0   1   0   0
3   1   0   1   0   0   0   1   0   0   0   0   0
4   1   0   0   0   0   0   0   0   1   0   1   0
5   0   0   0   0   0   0   0   1   0   0   0   0
6   0   0   0   0   0   0   0   0   0   0   0   1

And then represent your weights as a Series, indexed by key:

In [37]: weights = df2.drop_duplicates('key', keep='last').set_index('key')['value']

Then weight and sum your sets:

In [40]: totals = (df * weights).sum(axis=1)

In [41]: totals
Out[41]: 
0     4
1     6
2     6
3     4
4     4
5     1
6    16
dtype: float64

And then just find the top 6 rows:

In [55]: top6 = totals.order(ascending=False).head(6)

In [56]: top6
Out[56]: 
6    16
2     6
1     6
4     4
3     4
0     4
dtype: float64

You can use the indices back into the sparse matrix to recover which sets these were:

In [58]: df.ix[top6.index]
Out[58]: 
   1   2   3   4   5   6   7   9   11  12  15  16
6   0   0   0   0   0   0   0   0   0   0   0   1
2   0   1   1   1   0   0   0   0   0   1   0   0
1   1   0   1   0   1   1   0   0   0   0   0   0
4   1   0   0   0   0   0   0   0   1   0   1   0
3   1   0   1   0   0   0   1   0   0   0   0   0
0   1   0   1   0   1   0   0   0   0   0   0   0

You might not like this approach but I would point out having frames of data structures like sets rather than primitives as elements isn't particularly pandas-ish, so some translation of the problem is advised.

Andrew Rosenfeld
  • 1,711
  • 1
  • 12
  • 9