0

I am trying to find subsets (of any lengths) of attribute (column) values, which are unique in a given dataset. The most efficient way to the best of my knowledge to find those is by computing multiple (many) groupby activities counting the corresponding group sizes in pandas. As the loop can become pretty large, what is the most efficient way to speed up those many group by tasks on the same dataset?

groups = [["a","b"],["a","b","c"],["c","a"]] # this can be really large
df     = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]), columns=['a', 'b', 'c'])
for g in groups:
    r = df.groupby(g, sort=False).size().reset_index().rename(columns={0:'count'})
    if r.loc[r['count']==1]['count'].count() > 0:
         # do something
Reacher234
  • 230
  • 2
  • 11

2 Answers2

0

You can use RAPIDS cudf for fast processing of groupbys in large dataset. I'm working on a more nuanced answer for your real question on efficiency, but I am spending more time trying to scale out your example than solving for it. Wouldn't mind your help on that by sending me a larger sample. BTW, how large is your dataset?

HOWEVER, for a quick intermediary answer, just swapping in cudf for pandas got a 13x speed up with a financial institute with a similar problem here, at scale. Here's how you do it with your code:

groups = [["a","b"],["a","b","c"],["c","a"]] # this can be really large
df     = cudf.DataFrame({'a':[1, 2, 3], 'b': [4, 5, 6], 'c':[7, 8, 9]}) #needs to be a dict or one for the cudf-compatible data formats 
for g in groups:
    r = df.groupby(g, sort=False).size().reset_index().rename(columns={0:'count'})
    if r.loc[r['count']==1]['count'].count() > 0:
        print(r)

I'll update my answer once I get that additional information

TaureanDyerNV
  • 1,208
  • 8
  • 9
0

Depending on the size of your dataset it might be better to use dask-cudf library over the cudf library. Please note that, for small datasets using dask-cudf will take more time for the above process. The dask-cudf implementation of the above snippet of code:

import cudf
import dask_cudf

groups = [["a","b"],["a","b","c"],["c","a"]] # this can be really large
df = cudf.DataFrame({'a':[1, 2, 1], 'b': [4, 5, 4], 'c':[7, 8, 9]})
ddf = dask_cudf.from_cudf(df, npartitions=2) # value for npartitons can be changed to see speed up
for g in groups:
    r = ddf.groupby(g, sort=False).size().reset_index().rename(columns={0:'count'})
    num_non_repeated_values = r.loc[r['count']==1]['count'].count().compute()
    if num_non_repeated_values > 0:
       print(num_non_repeated_values)

It would really help me to find a more efficient algorithm to implement the above code if you can provide more information about what you want to do with the obtained information r or num_non_repeated_values.

saloni
  • 296
  • 1
  • 7