python performance problems while classifing column values

Question

This question is strongly related to my question earlier: here
Sorry that I have to ask again!

The code below is running and delivering the correct results but its again somehow slow (4 mins for 80K rows). I have problems to use the Series class from pandas for concrete values. Can someone recommend how I can instead classify those columns?

Could not find relevant information in the documentary:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.html

Running Code:

# p_test_SOLL_test_D10

for x in range (0,len(tableContent[6])):
    var = tableContent[6].loc[x, ('p_test_LAENGE')]
    if float(tableContent[6].loc[x, ('p_test_LAENGE')])>=100.0:
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes'
    elif (float(tableContent[6].loc[x, ('p_test_LAENGE')]) <30.0 and float(tableContent[6].loc[x, ('p_test_LAENGE')]) >= 10):
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes2'
    elif (float(tableContent[6].loc[x, ('p_test_LAENGE')]) <10.0 and float(tableContent[6].loc[x, ('p_test_LAENGE')]) >= 5):
        tableContent[6].loc[x, ('p_test_LAENGE')]='yes3'
    else:
        tableContent[6].loc[x, ('p_test_LAENGE')]='no'

print (tableContent[6]['p_test_LAENGE'])

Series Try:

if tableContent[6]['p_test_LAENGE'].astype(float) >=100.0:
    tableContent[6]['p_test_LAENGE']='yes'
elif (tableContent[6]['p_test_LAENGE'].astype(float) <30.0 and tableContent[6]['p_test_LAENGE'].astype(float) >= 10):
    tableContent[6]['p_test_LAENGE']='yes1'
elif (tableContent[6]['p_test_LAENGE'].astype(float) <10.0 and tableContent[6]['p_test_LAENGE'].astype(float) >= 5):
    tableContent[6]['p_test_LAENGE']='yes2'
else:
    tableContent[6]['p_test_LAENGE']='no'


print (tableContent[6]['p_test_LAENGE'])

There are way more experienced people with Pandas (probably looking at this question right now). So, I won't even try to give you a canonical approach. However, the advice in your previous question mentions "vectorization" and "take away the loop". If you're using `for` loops, they run in roughly Python time (computationally) and you might as well not bother with `numpy` or `pandas` and just go vanilla python. It requires a shift in thinking, and a couple of tutorials might suit you well here before proceeding. — roganjosh, Jul 12 '17 at 23:27

score 1 · Accepted Answer · answered Jul 12 '17 at 23:43

1

I do not have your df to test so you need to modify the following code. Assume that min of df is greater than 10e-7 while max of df is less than 10e7

bin = [10e-7,5,10,30,100,10e7]
label = ['no','yes2','yes1','no','yes']
df['p_test_LAENGE_class'] = pd.cut(df['p_test_LAENGE'], bins=bin, labels=label)

Hope this will help you

answered Jul 12 '17 at 23:43

Mr_U4913

1,294
8
12

Thank you! The cut method was what I've been looking for! – Bene Jul 13 '17 at 09:07

python performance problems while classifing column values

Running Code:

Series Try:

1 Answers1