2

I have a dataframe in the format below:

Original Dataframe

    |  x  |  value1  |  value2  |  value3  |  value4
 ---|-----|----------|----------|----------|-----------
  0 |  1  |    1     |   NaN    |    3     |   1
  1 |  2  |    4     |   NaN    |    1     |   NaN
  2 |  3  |    2     |    6     |    1     |   2
  3 |  4  |    1     |    1     |    2     |   1

My goal is to come up with the mean value, for each line, dropping NaN and also eliminating outliers. The goal is to reach a new dataframe with the following format:

Desired Dataframe

    |  x  |  mean (after dropping the NaN and the outliers)*
 ---|-----|--------
  0 |  1  |   a
  1 |  2  |   b   
  2 |  3  |   c   
  3 |  4  |   d   

*Note that a, b, c, d are the mean values (I did not calculate the result)


The ultimate goal after reaching the result is to print a graph for values over time.


Working with a single list of values, e.g., the equivalent of one column of values and one column of indices, I can do all the operations: drop NaN, calculate Z score and then returning a list of the values, as shown below:

import pandas as pd   
import numpy as np
from scipy import stats

data = {'value': [1, 2, 15, np.NaN, 2, 2, 2, 3, 1, 1], 
        'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)

df.dropna(inplace=True)
df = df[(np.abs(stats.zscore(df['return'])) < 2)]

There are a few problems that arise when I try to work in bulk (and I'm sure they arise because I'm used to program using loops and there must be a "smart" way to do this operation in Pandas.)

I cannot calculate the Z score when there is a NaN value. I got this error:

/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel_launcher.py:14: RuntimeWarning: invalid value encountered in less

So I know that I need to eliminate the NaN. But I cannot do it in this dataframe format since this would imply in eliminate either row or column resulting in loss of information.

The other approach that I tried without success was to transform this table to the long form, meaning that I'd have:

    |  x  |  valueName  |  actualValue
 ---|-----|-------------|--------------
  0 |  1  |  value1     |      1       
  1 |  1  |  value2     |     NaN    
 ...  ...      ...            ...
  2 |  2  |  value2     |      4  
  3 |  2  |  value2     |     NaN

This made me drop the Nan, but to use Z score it was more problematic. I'm sure this is a common problem, but I cannot figure it out how to solve it.

Dan
  • 127
  • 9
  • what about changing `NaN` to `0` ? – Sociopath Mar 27 '19 at 08:10
  • I think your question has been answered [here](https://stackoverflow.com/questions/23451244/how-to-zscore-normalize-pandas-column-with-nans) – Yohai Magan Mar 27 '19 at 08:19
  • Possible duplicate of [how to zscore normalize pandas column with nans?](https://stackoverflow.com/questions/23451244/how-to-zscore-normalize-pandas-column-with-nans) – Yohai Magan Mar 27 '19 at 08:20
  • Hi AhshayNevrekar, if I do that, the mean will be influenced. In the case, " ", 0.0, or NaN stands for no-entry. – Dan Mar 27 '19 at 11:04
  • Hi yochay magan, the link you posted refers to a one-dimensional series, which I could tackled by just using the dropna method. – Dan Mar 27 '19 at 11:52

1 Answers1

0

You can use:

from scipy import stats

#reshape to MultiIndex Series for remove NaNs
s = df.set_index('x').stack()
print (s)
x        
1  value1    1.0
   value3    3.0
   value4    1.0
2  value1    4.0
   value3    1.0
3  value1    2.0
   value2    6.0
   value3    1.0
   value4    2.0
4  value1    1.0
   value2    1.0
   value3    2.0
   value4    1.0
dtype: float64

#count zsore by first level of group - by x
s1 = s.groupby(level=0).transform(lambda x: np.abs(stats.zscore(x)))
print (s1)
x        
1  value1    0.707107
   value3    1.414214
   value4    0.707107
2  value1    1.000000
   value3    1.000000
3  value1    0.390567
   value2    1.692456
   value3    0.911322
   value4    0.390567
4  value1    0.577350
   value2    0.577350
   value3    1.732051
   value4    0.577350

#filter by condition and get mean by first level x, convert to DataFrame
s2 = s[s1 < 2].mean(level=0).reset_index(name='mean')
print (s2)
   x      mean
0  1  1.666667
1  2  2.500000
2  3  2.750000
3  4  1.250000
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252