-1

I have a Pandas DataFrame and I want to find all rows where the i'th column values are 10 times greater than other columns. Here is an example of my DataFrame:

Sample data

For example, looking at column i=0, row B (0.344) its is 10x greater than values in the same row but in other columns (0.001, 0, 0.009, 0). So I would like:

my_list_0=[False,True,False,False,False,False,False,False,False,False,False]

The number of columns might change hence I don't want a solution like:

#This is good only for a DataFrame with 4 columns.
my_list_i = data.loc[(data.iloc[:,i]>10*data.iloc[:,(i+1)%num_cols]) &
                     (data.iloc[:,i]>10*data.iloc[:,(i+2)%num_cols]) &
                     (data.iloc[:,i]>10*data.iloc[:,(i+3)%num_cols])]

Any idea? thanks.

Eran
  • 527
  • 8
  • 15

1 Answers1

1

Given the df:

df = pd.DataFrame({'cell1':[0.006209, 0.344955, 0.004521, 0, 0.018931, 0.439725, 0.013195, 0.009045, 0, 0.02614, 0],
              'cell2':[0.048043, 0.001077, 0,0.010393, 0.031546, 0.287264, 0.016732, 0.030291, 0.016236, 0.310639,0], 
              'cell3':[0,0,0.020238, 0, 0.03811, 0.579348, 0.005906, 0,0,0.068352, 0.030165],
              'cell4':[0.016139, 0.009359, 0,0,0.025449, 0.47779, 0, 0.01282, 0.005107, 0.004846, 0],
              'cell5': [0,0,0,0.012075, 0.031668, 0.520258, 0,0,0,2.728218, 0.013418]})
i = 0

You can use

(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).all(1)

To get

0     False
1      True
2     False
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
dtype: bool

for any number of columns. This drops column i, multiplies the remaining df by 10, and checks row-wise for being less than i, then returns True only if all values in the row are True. So it returns a vector of True for each row where this obtains and False for others.

If you want to give an arbitrary threshold, you can sum the Trues and divide by the number of columns - 1, then compare with your threshold:

thresh = 0.5  # or whatever you want
(10 * df.drop(df.columns[i], axis=1)).lt(df.iloc[:,i], axis=0).sum(1) / (df.shape[1] - 1) > thresh

0     False
1      True
2      True
3     False
4     False
5     False
6     False
7     False
8     False
9     False
10    False
dtype: bool
Josh Friedlander
  • 10,870
  • 5
  • 35
  • 75
  • I don't see how is it going to work (check column i with each and every other column). – Eran Feb 14 '19 at 20:50
  • `sum(1)` sums every row. Why don't you check it and see? – Josh Friedlander Feb 15 '19 at 04:41
  • I did check it, hence asking. The condition translates to df.iloc[:,i] > 10*sum_of_row. This doesn't check that df.iloc[:,i] is greater element wise. Do I miss something? – Eran Feb 15 '19 at 14:42
  • Edited, in future please copy-paste code rather than image as it saves people having to type it up again from scratch in order to reproduce. – Josh Friedlander Feb 17 '19 at 06:30
  • By the way - is there a Pandas operator "between" .any() and .all()? (i.e. True when some percentage of columns satisfy the condition). I guess I can do it with df.sample(frac=0.5,axis=1) after df.drop. – Eran Feb 17 '19 at 11:59
  • Better idea is to sum the number of Trues, and then divide by size of matrix, I've added it to my answer – Josh Friedlander Feb 17 '19 at 12:53
  • Yes, I did something similar - just counted the number of True's and selected based on that: data2 = (10* data).lt(data[data.columns[i]], axis=0) data2['sum_all'] = data2.sum(axis=1) temp_list = data2.loc[data2.sum_all>=3].index – Eran Feb 17 '19 at 13:39