0

I have a dataframe with several hundered rows and columns and want to drop all NaNs. Unfortunatly there are NaNs in all columns and all rows.

df = df.dropna(how = "any") 

would therefore result in an empty dataframe. I use a while loop to iteratively dropnan columns with a threshhold.

i = 0
while df.isna().sum().sum() != 0:
    i += 0.01
    df= df.dropna(thresh=(i * df.shape[0]), axis= 0)
    df= df.dropna(thresh=(i * df.shape[1]), axis= 1)

This greedy algoritm is in more than one way for sure a sub optimal solution.
Aside from writing my own linear program to minimize deleted data, is there maybe a build in fuction that I do not know? My goal is to preserve as much data as possible.

Tom S
  • 591
  • 1
  • 5
  • 21
  • Can you explain why you want to drop rows and columns with NaNs? Especially why is it ok to preserve some rows and columns with less than a given number of NaNs? – Thilina Dissanayake Apr 20 '21 at 14:24
  • The long term goal is to apply a regression task, where the algorithms cant handle NaN values. I also already experimented with .fillna(), but because of the amount of nans in my dataset I fear potential quality loss. Regarding your second questions. I do not quite understand what you mean. When the while loop is left, there are no NaNs left. – Tom S Apr 20 '21 at 14:33

1 Answers1

0

Given your motivation in the comments, I can recommend you to try using interpolate() method of Pandas.

df = df.interpolate()

You can experiment with different methods of interpolation using the method.

As an example, you can use

df = df.interpolate(method='quadratic')

if you have time-series data that is growing at an increased rate. (Note that you have to have scipy installed to use the method argument.)

Please refer to the Pandas documentation here.

Furthermore, you can try to experiment with other data imputation methods. Some data imputation methods are explained in this article.

Especially, Hot Deck imputation might work in your case.

  • Thank you very much for your effords. This would probabily help most people in a simular situation as mine. My datapoints unfortunatly are uncorrilated. Therefore interpolation is not the write way to go. – Tom S Apr 20 '21 at 15:04
  • Sorry I could not help. But take a look at Hot Deck imputation that searches for a similar instance (doner) and replaces the missing value with the value of the doner. https://stackoverflow.com/questions/59541759/hot-deck-imputation-in-python – Thilina Dissanayake Apr 20 '21 at 15:34