Identify increasing features in a data frame

Question

I have a data frame that present some features with cumulative values. I need to identify those features in order to revert the cumulative values. This is how my dataset looks (plus about 50 variables):

What I wish to achieve is:

I've seem this answer, but it first revert the values and then try to identify the columns. Can't I do the other way around? First identify the features and then revert the values?

Finding cumulative features in dataframe?

What I do at the moment is run the following code in order to give me the feature's names with cumulative values:

 def accmulate_col(value):
     count = 0
     count_1 = False
     name = []
     for i in range(len(value)-1):
         if value[i+1]-value[i] >= 0:
             count += 1
         if value[i+1]-value[i] > 0:
             count_1 = True
     name.append(1) if count == len(value)-1 and count_1 else name.append(0)
     return name

 df.apply(accmulate_col)

Afterwards, I save these features names manually in a list called cum_features and revert the values, creating the desired dataset:

df_clean = df.copy()
df_clean[cum_cols] = df_clean[cum_features].apply(lambda col: np.diff(col, prepend=0))

Is there a better way to solve my problem?

You should definitely be favouring difference calculating functions over doing the iterations yourself. Having said that, could you provide an example dataframe to work with? — Paritosh Singh, Aug 06 '19 at 12:50

score 0 · Accepted Answer · answered Aug 06 '19 at 13:13

To identify which columns have increasing* values throughout the whole column, you will need to apply conditions on all the values. So in that sense, you have to use the values first to figure out what columns fit the conditions.

With that out of the way, given a dataframe such as:

import pandas as pd
d = {'a': [1,2,3,4],
     'b': [4,3,2,1]
     }
df = pd.DataFrame(d)
#Output:
   a  b
0  1  4
1  2  3
2  3  2
3  4  1

Figuring out which columns contain increasing values is just a question of using diff on all values in the dataframe, and checking which ones are increasing throughout the whole column.

That can be written as:

out = (df.diff().dropna()>0).all()
#Output:
a     True
b    False
dtype: bool

Then, you can just use the column names to select only those with True in them

new_df = df[df.columns[out]]
#Output:
   a
0  1
1  2
2  3
3  4

*(the term cumulative doesn't really represent the conditions you used.Did you want it to be cumulative or just increasing? Cumulative implies that the value in a particular row/index was the sum of all previous values upto that index, while increasing is just that, the value in current row/index is greater than previous.)

Identify increasing features in a data frame

1 Answers1