I have a very large pandas DataFrame (>100 million rows, and >1000s of columns). Each row has a unique label as index, for most of the rows, only one column contains value. I want to make a new DataFrame by deleting those rows with only one of the columns has value, and keeping those rows that with more than two columns have values.
Asked
Active
Viewed 660 times
1
-
Can you give example data? – Veedrac Dec 21 '14 at 21:23
-
Can you provide some samples? – Alex K Dec 21 '14 at 21:29
-
Thanks.dshwshr. Obviously, I am not a programmer and trying to learn. I have a much smaller dataframe to work on, which is about 20 million rows and 200-300 columns. But after removing those unwanted rows, the remaining data should be workable? Maybe the question is how to construct such a database the first place? @ Veedrac, the example data can be downloaded from here: https://www.dropbox.com/s/qzuzn0ynn91epcn/Masterdict.csv?dl=0 – snps Dec 21 '14 at 21:30
1 Answers
1
You can drop them using dropna
:
In [3]:
#sample df
df = pd.DataFrame({'a':[0,NaN, 2,3,4], 'b':[0,NaN, 2,3,NaN], 'c':arange(5)})
df
Out[3]:
a b c
0 0 0 0
1 NaN NaN 1
2 2 2 2
3 3 3 3
4 4 NaN 4
In [5]:
# drop just the rows which have 2 or more NaN values
df.dropna(thresh=2, axis=0)
Out[5]:
a b c
0 0 0 0
2 2 2 2
3 3 3 3
4 4 NaN 4
You pass the params thresh=2
to specify that you require at least 2 non-NA values, and axis=0
will specify that the criteria should be applied row-wise.

EdChum
- 376,765
- 198
- 813
- 562
-
Nice! But it did not work on the example data I tried for some reason. I used df.read_csv to get the file in, maybe those cells witout values are not really NaN? – snps Dec 21 '14 at 22:42
-
Are they numeric at all?, in which case they maybe string or you have malformed data, all it takes is for a single value in the column to be a string and this will make the column be a mixed dtype, you could try converting first: `df.convert_objects(convert_numeric=True)` see the [docs](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.convert_objects.html#pandas.DataFrame.convert_objects). If this desn't work you're going to have to hunt down the erroneous data – EdChum Dec 21 '14 at 22:46
-
Sorry, forgot to mention the first column (which I used as label and really do not need the index) is str. Can I ignore that while apply the dropna or use that column as axis? take a look at the example file at the link? https://www.dropbox.com/s/qzuzn0ynn91epcn/Masterdict.csv?dl=0 – snps Dec 21 '14 at 22:51
-
If you read it in as a csv you specify this col as the index: `df = pd.read_csv('file.csv', index_col=0)` or just set the index after `df.set_index(keys='col1')` then you can drop the rows as per my example – EdChum Dec 21 '14 at 22:54