2

I want to sort through a Dataframe of about 400k rows, with 4 columns, taking out roughly half of them with an if statement:

    for a in range (0, howmanytimestorunthrough): 
        if ('Primary' not in DataFrameexample[a]):
            #take out row

So far I've been testing either one of the 4 below:

newdf.append(emptyline,)
nefdf.at[b,'column1'] = DataFrameexample.at[a,'column1']
nefdf.at[b,'column2'] = DataFrameexample.at[a,'column2']
nefdf.at[b,'column3'] = DataFrameexample.at[a,'column3']
nefdf.at[b,'column4'] = DataFrameexample.at[a,'column4']
b = b + 1

or the same with .loc

newdf.append(emptyline,)
nefdf.loc[b,:] = DataFrameexample.loc[a,:]
b = b + 1

or changing the if (not in) to an if (in) and using:

DataFrameexample = DataFrameexample.drop([k])

or trying to set emptyline to have values, and then append it:

notemptyline = pd.Series(DataFrameexample.loc[a,:].values, index = ['column1', 'column2', ...) 
newdf.append(notemptyline, ignore_index=True)

So from what I've managed to test so far, they all seem to work ok on a small number of rows (2000), but once I start getting a lot more rows they take exponentially longer. .at seems slighly faster than .loc even if I need it to run 4 times, but still gets slow (10 times the rows, takes longer than 10 times). .drop I think tries to copy the dataframe each time, so really doesn't work? I can't seem to get .append(notemptyline) to work properly, it just replaces index 0 over and over again.

I know there must be an efficient way of doing this, I just can't seem to quite get there. Any help?

Sam
  • 23
  • 6

1 Answers1

3

Your speed problem has nothing to do with .loc vs .at vs ... (for a comparisson between .loc and .at look have a look at this question) but comes from explicitly looping over every row of your dataframe. Pandas is all about vectorising your operations.

You want to filter your dataframe based on a comparison. You can transform that to a boolean indexer.

indexer = df!='Primary'

This will give you a 4 by n rows dataframe with boolean values. Now you want to reduce the dimension to 1 x n rows such that the value is true if all values in the row (axis 1) are true.

indexer = indexer.all(axis=1)

Now we can use .loc to to get only the rows were indexer is True

df = df.loc[indexer]

This will be much faster then iterating over the rows.

EDIT:

To check if the df entry contains a string you can replace the first row:

indexer = df.apply(lambda x: x.str.contains('Primary'))

Note that you normally don't want to use an apply statement (internally it uses a for loop for custom functions) to iterate over a lot of elements. In this case we are looping over the columns which is fine if you just have a couple of those.

P.Tillmann
  • 2,090
  • 10
  • 17
  • Thanks, think I'm slowly getting there. How could I change that first line to test to see if 'Primary' is part of a string? I've got values like: Test Primary, Test, Something else, Something Primary, That line seems to just test if Primary is the whole string? – Sam Dec 19 '17 at 11:08