Finding count of distinct elements in DataFrame in each column

Question

I am trying to find the count of distinct values in each column using Pandas. This is what I did.

import pandas as pd
import numpy as np

# Generate data.
NROW = 10000
NCOL = 100
df = pd.DataFrame(np.random.randint(1, 100000, (NROW, NCOL)),
                  columns=['col' + x for x in np.arange(NCOL).astype(str)])

I need to count the number of distinct elements for each column, like this:

col0    9538
col1    9505
col2    9524

What would be the most efficient way to do this, as this method will be applied to files which have size greater than 1.5GB?

Based upon the answers, df.apply(lambda x: len(x.unique())) is the fastest (notebook).

%timeit df.apply(lambda x: len(x.unique())) 10 loops, best of 3: 49.5 ms per loop %timeit df.nunique() 10 loops, best of 3: 59.7 ms per loop %timeit df.apply(pd.Series.nunique) 10 loops, best of 3: 60.3 ms per loop %timeit df.T.apply(lambda x: x.nunique(), axis=1) 10 loops, best of 3: 60.5 ms per loop

score 94 · Accepted Answer · edited Nov 09 '18 at 21:14

94

As of pandas 0.20 we can use nunique directly on DataFrames, i.e.:

df.nunique()
a    4
b    5
c    1
dtype: int64

Other legacy options:

You could do a transpose of the df and then using apply call nunique row-wise:

In [205]:
df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

Out[205]:
   a  b  c
0  0  1  1
1  1  2  1
2  1  3  1
3  2  4  1
4  3  5  1

In [206]:
df.T.apply(lambda x: x.nunique(), axis=1)

Out[206]:
a    4
b    5
c    1
dtype: int64

EDIT

As pointed out by @ajcr the transpose is unnecessary:

In [208]:
df.apply(pd.Series.nunique)

Out[208]:
a    4
b    5
c    1
dtype: int64

edited Nov 09 '18 at 21:14

Max Ghenis

14,783
16
84
132

answered May 28 '15 at 10:09

EdChum

376,765
198
813
562

Thanks! I just want to clarify, so what is happening is that for every column in df gets passed into apply one at a time where it uses pd.Series.nunique to get unique count? So basically for each column it runs .nunique() function? – haneulkim Jan 05 '21 at 02:43

score 7 · Answer 2 · edited May 29 '15 at 12:37

7

A Pandas.Series has a .value_counts() function that provides exactly what you want to. Check out the documentation for the function.

edited May 29 '15 at 12:37

Michele d'Amico

22,111
8
69
76

answered May 29 '15 at 11:34

CaMaDuPe85

79
1

2

Can you demonstrate how this would look as you've posted no code and output – EdChum May 10 '16 at 08:34
@EdChum - df.value_counts(dropna = False).sort_index() will work. – MJP Jul 21 '18 at 22:26
>AttributeError: 'DataFrame' object has no attribute 'value_counts' – Max Ghenis Nov 09 '18 at 21:53
Only works on a Series, so you should subset the dataframe to one column. – Archie Jun 05 '20 at 13:08

score 6 · Answer 3 · edited Nov 09 '18 at 21:52

6

Already some great answers here :) but this one seems to be missing:

df.apply(lambda x: x.nunique())

As of pandas 0.20.0, DataFrame.nunique() is also available.

edited Nov 09 '18 at 21:52

Max Ghenis

14,783
16
84
132

answered Apr 13 '17 at 11:45

Sander van den Oord

10,986
5
51
96

score 1 · Answer 4 · edited Oct 18 '16 at 22:09

Recently, I have same issues of counting unique value of each column in DataFrame, and I found some other function that runs faster than the apply function:

#Select the way how you want to store the output, could be pd.DataFrame or Dict, I will use Dict to demonstrate:
col_uni_val={}
for i in df.columns:
    col_uni_val[i] = len(df[i].unique())

#Import pprint to display dic nicely:
import pprint
pprint.pprint(col_uni_val)

This works for me almost twice faster than df.apply(lambda x: len(x.unique()))

score 1 · Answer 5 · answered May 01 '20 at 20:19

1

I found:

df.agg(['nunique']).T

much faster

answered May 01 '20 at 20:19

yami

23
7

score 0 · Answer 6 · edited Nov 09 '18 at 21:55

0

df.apply(lambda x: len(x.unique()))

edited Nov 09 '18 at 21:55

Max Ghenis

14,783
16
84
132

answered May 10 '18 at 14:35

zehai

21
3

score 0 · Answer 7 · answered Aug 05 '19 at 11:51

Need to segregate only the columns with more than 20 unique values for all the columns in pandas_python:

enter code here
col_with_morethan_20_unique_values_cat=[]
for col in data.columns:
    if data[col].dtype =='O':
        if len(data[col].unique()) >20:

        ....col_with_morethan_20_unique_values_cat.append(data[col].name)
        else:
            continue

print(col_with_morethan_20_unique_values_cat)
print('total number of columns with more than 20 number of unique value is',len(col_with_morethan_20_unique_values_cat))



 # The o/p will be as:
['CONTRACT NO', 'X2','X3',,,,,,,..]
total number of columns with more than 20 number of unique value is 25

score 0 · Answer 8 · answered Apr 22 '20 at 17:25

Adding the example code for the answer given by @CaMaDuPe85

df = pd.DataFrame({'a':[0,1,1,2,3],'b':[1,2,3,4,5],'c':[1,1,1,1,1]})
df

# df
    a   b   c
0   0   1   1
1   1   2   1
2   1   3   1
3   2   4   1
4   3   5   1


for cs in df.columns:
    print(cs,df[cs].value_counts().count()) 
    # using value_counts in each column and count it 

# Output

a 4
b 5
c 1

Finding count of distinct elements in DataFrame in each column

8 Answers8

Linked

Related