addition of three columns based on filtering and user input using pandas

Question

I have a data frame.It is one intermediate csv file . It has following data.

 sv1   val1    sv2    val2    sv3   val3
   2     0.2     4      0.6      8     0.3
   2     0.1     6      0.1      8     0.11
   2     0.12    6      -0.3     8     0.2
   5     0       4      1.6      8     0.7
   2     0.34    6      2.3      8     0.12
   ...   ....   ...     ....    ...   .....

Aim:Addition of val1+val2+val3 if sv1,sv2,sv3 does not contain 5. If any svs column(say sv1) contain 5 then addition would be val2+val3

# Attempt
import pandas as pd
names=['sv1','sv2','sv3','val1','val2','val3']
df=pd.read_csv('Myfile.csv',names=names)
discard_id=int(raw_input('enter the number to discard')

add_result=df.loc[['sv1','sv2','sv3']!=discard_id]
           .....
          perform  addition

jezrael · Accepted Answer · 2018-02-12T14:25:45.947

First compare all values by discard_id and get any for at least one True per row. Then sum columns by subsets and add to new column by numpy.where:

discard_id = 5

m = (df[['sv1','sv2','sv3']] == discard_id).any(axis=1)
sum1 = df[['val1','val2','val3']].sum(axis=1)
sum2 = df[['val2','val3']].sum(axis=1)

df['new'] = np.where(m, sum2, sum1)

print (df)
   sv1  val1  sv2  val2  sv3  val3   new
0    2  0.20    4   0.6    8  0.30  1.10
1    2  0.10    6   0.1    8  0.11  0.31
2    2  0.12    6  -0.3    8  0.20  0.02
3    5  0.00    4   1.6    8  0.70  2.30
4    2  0.34    6   2.3    8  0.12  2.76

Detail:

print (m)
0    False
1    False
2    False
3     True
4    False
dtype: bool

print (sum1)
0    1.10
1    0.31
2    0.02
3    2.30
4    2.76
dtype: float64

print (sum2)
0    0.90
1    0.21
2   -0.10
3    2.30
4    2.42
dtype: float64

Timings:

df = pd.concat([df] * 1000, ignore_index=True)

In [312]: %%timeit
     ...: m = (df[['sv1','sv2','sv3']] == discard_id).any(axis=1)
     ...: sum1 = df[['val1','val2','val3']].sum(axis=1)
     ...: sum2 = df[['val2','val3']].sum(axis=1)
     ...: df['new'] = np.where(m, sum2, sum1)
     ...: 
100 loops, best of 3: 2.77 ms per loop

#jp_data_analysis's solution
In [313]: %%timeit
     ...: df['sum'] = df.apply(summer, axis=1, num=5)
     ...: 
1 loop, best of 3: 287 ms per loop

@ jezrael . can not we use df.loc[' '] instead of only df['paramter']? — Poka, Feb 13 '18 at 03:35
Yes, it is possible. `print (df.loc[:, ['paramter', 'paramter1']])` is same as `print (df[['paramter', 'paramter1']])` if want select columns names. If want select index values need `print (df.loc[[0, 1]])` and for one idex value - `print (df.loc[0])`. — jezrael, Feb 13 '18 at 06:11
@ jezrael . So I can write m = (df.loc[:,['sv1','sv2','sv3']] == discard_id).any(axis=1)? in previous example — Poka, Feb 13 '18 at 06:44

score 0 · Answer 2 · answered Feb 12 '18 at 14:15

0

This is one way:

def summer(row, num):
    return sum(i for i, j in zip([row['val1'], row['val2'], row['val3']],
                                 [row['sv1'], row['sv2'], row['sv3']]) if j!=num)

df['sum'] = df.apply(summer, axis=1, num=5)

answered Feb 12 '18 at 14:15

jpp

159,742
34
281
339

1

I add timings in my answer, I think in pandas is the best avoid loops - `apply` + `list comprehension`s if exist vectorized solutions. – jezrael Feb 12 '18 at 14:29
@jezrael. I agree with you and upvoted your answer. But, for smaller data sets, I would argue that mine is preferable for readability :). – jpp Feb 12 '18 at 14:31
Thank you. In past in some time I used list comprehensions more, but Jeff (one of main developer of pandas ) add comment after one my solution for avoid it. So I stop it. If data contains some lists or sets, then `list comprehension` is good choice ;) Only friend comment, because you like list comprehensions a lot :) – jezrael Feb 12 '18 at 14:34
technically this is a generator compression, no lists being created here :). but I accept your general point. – jpp Feb 12 '18 at 14:35
ya, my problem is I learn first pandas and then python, so bad wording ;) Good luck! – jezrael Feb 12 '18 at 14:36

addition of three columns based on filtering and user input using pandas

2 Answers2