I have a dataframe in the format below:
Original Dataframe
| x | value1 | value2 | value3 | value4
---|-----|----------|----------|----------|-----------
0 | 1 | 1 | NaN | 3 | 1
1 | 2 | 4 | NaN | 1 | NaN
2 | 3 | 2 | 6 | 1 | 2
3 | 4 | 1 | 1 | 2 | 1
My goal is to come up with the mean value, for each line, dropping NaN and also eliminating outliers. The goal is to reach a new dataframe with the following format:
Desired Dataframe
| x | mean (after dropping the NaN and the outliers)*
---|-----|--------
0 | 1 | a
1 | 2 | b
2 | 3 | c
3 | 4 | d
*Note that a, b, c, d are the mean values (I did not calculate the result)
The ultimate goal after reaching the result is to print a graph for values over time.
Working with a single list of values, e.g., the equivalent of one column of values and one column of indices, I can do all the operations: drop NaN, calculate Z score and then returning a list of the values, as shown below:
import pandas as pd
import numpy as np
from scipy import stats
data = {'value': [1, 2, 15, np.NaN, 2, 2, 2, 3, 1, 1],
'x': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}
df = pd.DataFrame(data)
df.dropna(inplace=True)
df = df[(np.abs(stats.zscore(df['return'])) < 2)]
There are a few problems that arise when I try to work in bulk (and I'm sure they arise because I'm used to program using loops and there must be a "smart" way to do this operation in Pandas.)
I cannot calculate the Z score when there is a NaN value. I got this error:
/home/jupyterlab/conda/lib/python3.6/site-packages/ipykernel_launcher.py:14: RuntimeWarning: invalid value encountered in less
So I know that I need to eliminate the NaN. But I cannot do it in this dataframe format since this would imply in eliminate either row or column resulting in loss of information.
The other approach that I tried without success was to transform this table to the long form, meaning that I'd have:
| x | valueName | actualValue
---|-----|-------------|--------------
0 | 1 | value1 | 1
1 | 1 | value2 | NaN
... ... ... ...
2 | 2 | value2 | 4
3 | 2 | value2 | NaN
This made me drop the Nan, but to use Z score it was more problematic. I'm sure this is a common problem, but I cannot figure it out how to solve it.