1

I have a very large pandas DataFrame (>100 million rows, and >1000s of columns). Each row has a unique label as index, for most of the rows, only one column contains value. I want to make a new DataFrame by deleting those rows with only one of the columns has value, and keeping those rows that with more than two columns have values.

snps
  • 13
  • 3
  • Can you give example data? – Veedrac Dec 21 '14 at 21:23
  • Can you provide some samples? – Alex K Dec 21 '14 at 21:29
  • Thanks.dshwshr. Obviously, I am not a programmer and trying to learn. I have a much smaller dataframe to work on, which is about 20 million rows and 200-300 columns. But after removing those unwanted rows, the remaining data should be workable? Maybe the question is how to construct such a database the first place? @ Veedrac, the example data can be downloaded from here: https://www.dropbox.com/s/qzuzn0ynn91epcn/Masterdict.csv?dl=0 – snps Dec 21 '14 at 21:30

1 Answers1

1

You can drop them using dropna:

In [3]:
#sample df
df = pd.DataFrame({'a':[0,NaN, 2,3,4], 'b':[0,NaN, 2,3,NaN], 'c':arange(5)})
df

Out[3]:
    a   b  c
0   0   0  0
1 NaN NaN  1
2   2   2  2
3   3   3  3
4   4 NaN  4
In [5]:
# drop just the rows which have 2 or more NaN values
df.dropna(thresh=2, axis=0)
Out[5]:
   a   b  c
0  0   0  0
2  2   2  2
3  3   3  3
4  4 NaN  4

You pass the params thresh=2 to specify that you require at least 2 non-NA values, and axis=0 will specify that the criteria should be applied row-wise.

EdChum
  • 376,765
  • 198
  • 813
  • 562
  • Nice! But it did not work on the example data I tried for some reason. I used df.read_csv to get the file in, maybe those cells witout values are not really NaN? – snps Dec 21 '14 at 22:42
  • Are they numeric at all?, in which case they maybe string or you have malformed data, all it takes is for a single value in the column to be a string and this will make the column be a mixed dtype, you could try converting first: `df.convert_objects(convert_numeric=True)` see the [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html#pandas.DataFrame.convert_objects). If this desn't work you're going to have to hunt down the erroneous data – EdChum Dec 21 '14 at 22:46
  • Sorry, forgot to mention the first column (which I used as label and really do not need the index) is str. Can I ignore that while apply the dropna or use that column as axis? take a look at the example file at the link? https://www.dropbox.com/s/qzuzn0ynn91epcn/Masterdict.csv?dl=0 – snps Dec 21 '14 at 22:51
  • If you read it in as a csv you specify this col as the index: `df = pd.read_csv('file.csv', index_col=0)` or just set the index after `df.set_index(keys='col1')` then you can drop the rows as per my example – EdChum Dec 21 '14 at 22:54