Split dataframe output based on values

Question

This post covered Modification of a function to return a dataframe with specified values and I would like to further modify the output. The current function and vectorized version will get all combinations of columns subtracted from each other and return relevant data accordingly.

Example and test data:

import pandas as pd
import numpy as np
from itertools import combinations

df2 = pd.DataFrame(
       {'AAA' : [80,5,6], 
        'BBB' : [85,20,30],
        'CCC' : [100,50,25],
        'DDD' : [98,50,25],
        'EEE' : [103,50,25],
        'FFF' : [105,50,25],
        'GGG' : [109,50,25]});

df2

AAA BBB CCC DDD EEE FFF GGG
0   80  85  100 98  103 105 109
1   5   20  50  50  50  50  50
2   6   30  25  25  25  25  25

v = df2.values
df3 = df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)

df3   
    AAA BBB CCC DDD EEE FFF GGG
0   80.0    85.0    100 98  103 105 109
1   NaN NaN 50  50  50  50  50
2   NaN 30.0    25  25  25  25  25

All values within thresh (5 here) are returned on a per row basis with np.abs <=5.

What needs to change?

On the first row of df3 there are two clusters of values within thresh (80,85) and (100,98,103,105,109). They are all valid but are two separate groups as not within thresh. I would like to be able to separate these values based on another thresh value.

I have attempted to demonstrate what I am looking to do with the following (flawed) code and only including this to show that Im attempting to progress this myself..

df3.mask(df3.apply(lambda x : x >= df3.T.max() \
                   - (thresh * 3))).dropna(thresh=2).dropna(axis=1)


          AAA   BBB
    0   80.0    85.0

df3.mask(~df3.apply(lambda x : x >= df3.T.max() - (thresh * 3))).dropna(axis=1)


    CCC DDD EEE FFF GGG
0   100 98  103 105 109
1   50  50  50  50  50
2   25  25  25  25  25

So my output is nice (and shows close to desired output) but the way I got this is not so nice...

---Desired output: ---

I have used multiple rows to demonstrate but when I use this code it will only be one row that needs to be output and split. So desired output is to return the separate columns as per this example for row 0.

    CCC DDD EEE FFF GGG
0   100 98  103 105 109

and

         AAA    BBB
    0   80.0    85.0

Roman Pekar · Answer 1 · 2016-12-29T23:49:27.767

Well I think you can try to solve your problem differently. The idea is to get 'gaps and islands' within each row and label each group:

So, first - put your columns to rows and sort values within each initial row index:

>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df
0  AAA     80
   BBB     85
   DDD     98
   CCC    100
   EEE    103
   FFF    105
   GGG    109
1  AAA      5
   BBB     20
   GGG     50
   FFF     50
   DDD     50
   CCC     50
   EEE     50
2  AAA      6
   GGG     25
   EEE     25
   DDD     25
   CCC     25
   FFF     25
   BBB     30

Next, create new DataFrame with 'prev values' together with current values:

>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df = pd.concat([df, df.groupby(level=0).shift(1)], axis=1)
>>> df.columns = ['cur', 'prev']
>>> df
       cur   prev
0 AAA   80    NaN
  BBB   85   80.0
  DDD   98   85.0
  CCC  100   98.0
  EEE  103  100.0
  FFF  105  103.0
  GGG  109  105.0
1 AAA    5    NaN
  BBB   20    5.0
  GGG   50   20.0
  FFF   50   50.0
  DDD   50   50.0
  CCC   50   50.0
  EEE   50   50.0
2 AAA    6    NaN
  GGG   25    6.0
  EEE   25   25.0
  DDD   25   25.0
  CCC   25   25.0
  FFF   25   25.0
  BBB   30   25.0

And now, creating islands labels:

>>> df = (df['cur'] - df['prev'] > thresh).astype('int')
>>> df
0  AAA    0
   BBB    0
   DDD    1
   CCC    0
   EEE    0
   FFF    0
   GGG    0
1  AAA    0
   BBB    1
   GGG    1
   FFF    0
   DDD    0
   CCC    0
   EEE    0
2  AAA    0
   GGG    1
   EEE    0
   DDD    0
   CCC    0
   FFF    0
   BBB    0

>>> df.groupby(level=0).cumsum().unstack()
   AAA  BBB  CCC  DDD  EEE  FFF  GGG
0    0    0    1    1    1    1    1
1    0    1    2    2    2    2    2
2    0    1    1    1    1    1    1

Now you can filter out groups which have only one member and you're done :)

>>> dfm = df.groupby(level=0).cumsum().unstack()
>>> dfm
   AAA  BBB  CCC  DDD  EEE  FFF  GGG
0    0    0    1    1    1    1    1
1    0    1    2    2    2    2    2
2    0    1    1    1    1    1    1

>>> df2[dfm == 0].loc[0:0].dropna(axis=1)
   AAA   BBB
0   80  85.0
>>> df2[dfm == 1].loc[0:0].dropna(axis=1)
     CCC   DDD    EEE    FFF    GGG
0  100.0  98.0  103.0  105.0  109.0

Hi @Roman Pekar. Thanks for looking. Can you please show me how this can be used to give the desired output shown at the bottom of the post? — nipy, Dec 29 '16 at 23:33
It looks very interesting. Does this shift of the index and Island creation via `df['cur'] - df['prev'] > thresh` cover all combinations even if the dataset is larger? Does it matter that one of the values is lost when we shift prev ('30' from 'BBB'). It looks great, just trying to understand how this works with my simple brain :-) — nipy, Dec 30 '16 at 00:47
Yes it should cover all combinations. The idea here is to get 1 for every 'big jump ' row (where diff with previous is bigger then threshold) and 0 for other rows and then just cumulatively sum it up. In the end you'll have a label for each group — Roman Pekar, Dec 30 '16 at 00:51
Those NaN values for first rows are ok -you never have 1 there, so smallest value will always have label = 0 — Roman Pekar, Dec 30 '16 at 00:53
So if we have 5 groups it is a case of testing for '[dfm == 3]' and '[dfm == 4]' etc? — nipy, Dec 30 '16 at 00:56

piRSquared · Answer 2 · 2016-12-30T06:23:52.440

2

method 1
I copied and pasted from previous question including the minor change.

I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply

numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.

df2 = pd.DataFrame(
       {'AAA' : [80,5,6], 
        'BBB' : [85,20,30],
        'CCC' : [100,50,25],
        'DDD' : [98,50,25],
        'EEE' : [103,50,25],
        'FFF' : [105,50,25],
        'GGG' : [109,50,25]});

v = df2.values

# let delta be the distance threshold
# let k be the cluster size threshold
x, k = 5, 2  #  cluster size must be greater than k
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= x).sum(-1) <= k)
# note that this is the same as before but k = 1 was hard coded

print(df3)   

   AAA   BBB  CCC  DDD  EEE  FFF   GGG
0  NaN   NaN  100   98  103  105   NaN
1  NaN   NaN   50   50   50   50  50.0
2  NaN  30.0   25   25   25   25  25.0

edited Dec 30 '16 at 06:23

answered Dec 30 '16 at 02:21

piRSquared

285,575
57
475
624

Thanks @piRSquared. I marked the answer up as it will be very useful when looking for certain cluster sizes. – nipy Dec 30 '16 at 09:22
1

@adele this question has me spinning down a rabbit hole. I've been formulating a sophisticated and scalable solution. Eventually, I'll be able to identify clusters with this approach. When I do, I'll update this answer. – piRSquared Dec 30 '16 at 09:32
Very kind of you to look into this @piRSquared – nipy Dec 30 '16 at 09:45
1

@adele it isn't all altruism. I love interesting problems. This will also be very useful for me. But I'm glad it might help others as well. – piRSquared Dec 30 '16 at 09:47
1

@adele [this question and answer are the beginnings of it](http://stackoverflow.com/q/41394595/2336654) – piRSquared Dec 30 '16 at 10:27
The linked post was way beyond my current level of understanding but great to see the additional input. – nipy Dec 30 '16 at 13:11
Hi @piRSquared are you still spinning down the Rabbit hole :-) – nipy Jan 02 '17 at 09:13
@adele sort of. I'm messing with the idea off and on as I'm managing Holidays for my 2 and 3 year old and getting quick fixes with other questions. My issue at the moment is identifying the clusters. I haven't put as much effort into it as say... hats, but I've done enough to identify that what I thought I had as a good solution needs more work. I still think I have something, I just wouldn't expect me to get to it anytime soon. – piRSquared Jan 02 '17 at 09:18
No expectations @piRSquared Sir. Hope you enjoy the rest of your holidays :-) – nipy Jan 02 '17 at 09:34

score 2 · Accepted Answer · edited May 23 '17 at 11:55

I felt this was deserving of a separate answer.

I wrote a clustering function that operates on one dimensional arrays. I know how to vectorize it further to 2 dimensions but I haven't gotten to it yet. As it is, I use np.apply_along_axis

This function is described in this answer to this question. I encourage you to follow the links and see the work that went into getting this seemingly simple function.

What it does is find the clusters within an array defined by margins to the left and right of every point. It sorts, then clusters, then un sorts.

delta clustering function

def delta_cluster(a, dleft, dright):
    s = a.argsort()
    y = s.argsort()
    a = a[s]
    rng = np.arange(len(a))

    edge_left = a.searchsorted(a - dleft)
    starts = edge_left == rng

    edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
    ends = edge_right == rng

    return (starts & ends).cumsum()[y]

Onto the problem at hand

Use the cluster function for each row in df2 with np.apply_along_axis and construct a DataFrame named clusters that mirrors the same index and columns as df2. Then stack to get a Series which will make it easier to manipulate later.

clusters = pd.DataFrame(
    np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
    df2.index, df2.columns).stack()

This describes the next block of code.

I need to keep the row information of df2 when I do a groupby.
Use transform to get the size of clusters for each row.
stack the values of df2 and append the cluster values as part of the index. This enables the separation you are looking for.
mask val where size is equal to 1. These are singleton clusters.

lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')

val = df2.stack().to_frame('value').set_index(clusters, append=True).value

val.mask(size.values == 1).dropna().unstack(1)

      AAA   BBB    CCC   DDD    EEE    FFF    GGG
0 1  80.0  85.0    NaN   NaN    NaN    NaN    NaN
  2   NaN   NaN  100.0  98.0  103.0  105.0  109.0
1 3   NaN   NaN   50.0  50.0   50.0   50.0   50.0
2 2   NaN  30.0   25.0  25.0   25.0   25.0   25.0

This matches your results except I split out the first row into two rows.

     AAA   BBB    CCC   DDD    EEE    FFF    GGG
0   80.0  85.0    100    98    103    105    109
1    NaN   NaN     50    50     50     50     50
2    NaN  30.0     25    25     25     25     25

I will spend some time reading about how this works when I finish work today. It looks really impressive. — nipy, Jan 05 '17 at 09:05

Split dataframe output based on values

3 Answers3

Linked