Replace values on dataset and apply quartile rule by row on pandas

Question

I have a dataset with lots of variables. So I've extracted the numeric ones:

numeric_columns = transposed_df.select_dtypes(np.number)

Then I want to replace all 0 values for 0.0001

transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)

And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.

Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.

Can you create some data sample and expected output? – jezrael Apr 16 '20 at 10:28 — jezrael, Apr 16 '20 at 10:28

score 2 · Answer 1 · answered Apr 16 '20 at 10:48

To answer your first question

In [36]: from pprint import pprint

In [37]: pprint( numeric_columns.where.__doc__)
('\n'
 'Replace values where the condition is False.\n'
 '\n'
 'Parameters\n'
 '----------\n'

because of that your all the values except 0 are getting replaced

jezrael · Accepted Answer · 2020-04-16T10:59:30.930

1

Use DataFrame.mask and for second condition compare by DataFrame.quantile:

transposed_df = pd.DataFrame({
        'A':list('abcdef'),
         'B':[0,0.5,4,5,5,4],
         'C':[7,8,9,4,2,3],
         'D':[1,3,0,7,1,0],
         'E':[5,3,6,9,2,4],
         'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)

print (transposed_df)
   A    B  C    D  E  F
0  a -1.0  7  1.0  5  a
1  b -1.0  8  3.0  3  a
2  c  4.0  9 -1.0  6  a
3  d  5.0 -1  7.0  9  b
4  e  5.0  2 -1.0  2  b
5  f  4.0  3 -1.0  4  b

EDIT:

from  scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
          B         C         D         E
0 -2.236068  0.570352 -0.408248  0.073521
1  0.447214  0.950586  0.408248 -0.808736
2  0.447214  1.330821 -0.816497  0.514650
3  0.447214 -0.570352  2.041241  1.838037
4  0.447214 -1.330821 -0.408248 -1.249865
5  0.447214 -0.950586 -0.816497 -0.367607

EDIT1:

transposed_df = pd.DataFrame({
        'A':list('abcdef'),
         'B':[0,1,1,1,1,1],
         'C':[1,8,9,4,2,3],
         'D':[1,3,0,7,1,0],
         'E':[1,3,6,9,2,4],
         'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)

from  scipy.stats import zscore

df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
   A         B         C         D         E  F
0  a -1.732051  0.577350  0.577350  0.577350  a
1  b -1.063410  1.643452 -0.290021 -0.290021  a
2  c -0.816497  1.360828 -1.088662  0.544331  a
3  d -1.402136 -0.412393  0.577350  1.237179  b
4  e -1.000000  1.000000 -1.000000  1.000000  b
5  f -0.632456  0.632456 -1.264911  1.264911  b

edited Apr 16 '20 at 10:59

answered Apr 16 '20 at 10:33

jezrael

822,522
95
1,334
1,252

Why to use the mask solve the problem? Also, do you have any idea for the quartil question there is also there ? :) – Catarina Nogueira Apr 16 '20 at 10:34
1

@CatarinaNogueira - Working on it. – jezrael Apr 16 '20 at 10:35
1

@CatarinaNogueira - first quantile is `q=0.5` ? – jezrael Apr 16 '20 at 10:36
No, its 25, my idea is to use quantile() function :) – Catarina Nogueira Apr 16 '20 at 10:38
First, thanks a lot!!!!! But still, a question, if I change your B to 'B':[0,1,1,1,1,1], and apply transposed_df[numeric_columns.columns].apply(lambda x: stats.zscore(x)) before the next operations I've noticed that the output values are not the same on all B[ij] that are equal values. Do you have any clue of why? Since I am applying the same formula on the same number I was expecting same results. Seems like .apply is by column, not row – Catarina Nogueira Apr 16 '20 at 10:46
And if I do transposed_df[numeric_columns.columns].apply(lambda x: stats.zscore(x), axis = 1), as you did in your answer, I got an error – Catarina Nogueira Apr 16 '20 at 10:50
1

Sure! In the answer is applying Zscore by column, what I was aiming was to have the same by row. I wrote it wrong when I said it was 'B':[0,1,1,1,1,1] ( I am sorry), what I wanted to say is that I wanted a row with all same values, not a column. But If I change your data to have one row like: the row B C D E -> [0, 1, 1, 1] I don't get the same value for all the 1's when I apply the function to it – Catarina Nogueira Apr 16 '20 at 10:55
Then if I do transposed_df[numeric_columns.columns].apply(zscore, axis =1) it apply by row, but then I cannot attribute back to transposed_df[numeric_columns.columns] because I got the error: Must have equal len keys and value when setting with an iterable – Catarina Nogueira Apr 16 '20 at 10:58
1

@CatarinaNogueira - Is necessary DataFrame constructor, answer was edited. – jezrael Apr 16 '20 at 10:59
1

Ahhh ok, I didn't know this was a thing! Thank you infinitely for your help!!!! – Catarina Nogueira Apr 16 '20 at 11:03
Wait, but now if I apply this I cannot apply the masks after. I have to use the dataFrame constructor on them too, right? – Catarina Nogueira Apr 16 '20 at 11:06
1

@CatarinaNogueira - Can you show code? No sure if understand. – jezrael Apr 16 '20 at 11:09
1

I was going to send and then it worked, I guess it was some variable that was with an old value or something. Thanks again a lot for your attention!!! :) :) – Catarina Nogueira Apr 16 '20 at 11:18

Replace values on dataset and apply quartile rule by row on pandas

2 Answers2