-1

df,

Name
Sri
Sri,Ram
Sri,Ram,kumar
Ram

I am trying to calculate the value counts for each value. I am not getting my output when using

 df["Name"].values_count()

my desired output is,

 Sri     3
 Ram     3
 Kumar   1
Pyd
  • 6,017
  • 18
  • 52
  • 109

1 Answers1

4

split the column, stack to long format, then count:

df.Name.str.split(',', expand=True).stack().value_counts()

#Sri      3
#Ram      3
#kumar    1
#dtype: int64

Or maybe:

df.Name.str.get_dummies(',').sum()

#Ram      3
#Sri      3
#kumar    1
#dtype: int64

Or concatenate before value_counts:

pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))

#Sri      3
#Ram      3
#kumar    1
#dtype: int64

Timing:

%timeit df.Name.str.split(',', expand=True).stack().value_counts()
#1000 loops, best of 3: 1.02 ms per loop

%timeit df.Name.str.get_dummies(',').sum()
#1000 loops, best of 3: 1.18 ms per loop

%timeit pd.value_counts(pd.np.concatenate(df.Name.str.split(',')))
#1000 loops, best of 3: 573 µs per loop

# option from @Bharathshetty 
from collections import Counter
%timeit pd.Series(Counter((df['Name'].str.strip() + ',').sum().rstrip(',').split(',')))
# 1000 loops, best of 3: 498 µs per loop

# option inspired by @Bharathshetty 
%timeit pd.value_counts(df.Name.str.cat(sep=',').split(','))
# 1000 loops, best of 3: 483 µs per loop
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • what if I have lower case "sri" in index 2nd row, I want to include tht too – Pyd Oct 29 '17 at 02:51
  • 1
    If you want to ignore case, you can convert it to lower/upper case before proceeding. `df.Name.str.lower()....` – Psidom Oct 29 '17 at 02:52
  • I am not sure what you mean *apply ignorecase*. I don't think you can do `ignore case` while value counting. What's wrong with `str.lower()`? – Psidom Oct 29 '17 at 02:58
  • that will also work fine, is it possible without converting it to lower? like using re.IGNORECASE – Pyd Oct 29 '17 at 02:59
  • @pyd no lower is lower and upper is upper you cant ignore case like that while comparing. – Bharath M Shetty Oct 29 '17 at 03:00
  • @Psidom can you also add timings for `pd.Series(Counter((df['Name'].str.strip() + ',').sum().rstrip(',').split(',')))`. Felt it is faster than all of the options. – Bharath M Shetty Oct 29 '17 at 03:01
  • That's regular expression option. I am pretty sure you can't do things like that in `value_counts`. – Psidom Oct 29 '17 at 03:01
  • @Bharathshetty Sure, will do. Looks like an interesting option. – Psidom Oct 29 '17 at 03:03
  • And for a large dataframe `counter` gives 1.78 ms and `np.concatenate `gives 12ms. I dont suggest numpy for string operations. – Bharath M Shetty Oct 29 '17 at 03:05
  • 1
    Deserves an upvote for the inspired option :). I was thinking only join and forgot cat. Really nice one. – Bharath M Shetty Oct 29 '17 at 03:06
  • @Bharathshetty I agree. `numpy` seems indeed not optimized for string operations. – Psidom Oct 29 '17 at 03:07
  • @Bharathshetty, in Jezrael answer we are applying re.IGNORECASE that's why i asked, https://stackoverflow.com/questions/46930681/mapping-matching-word-count-on-a-column-using-pandas-in-python – Pyd Oct 29 '17 at 03:11
  • @Bharathshetty Join and split is the trick here. Thanks for bringing it up. – Psidom Oct 29 '17 at 03:11
  • re.IGNORECASE applies when you are using regex related methods, such as str.contains, str.findall, str.extract, etc. `value_counts` is not one of them. – Psidom Oct 29 '17 at 03:15
  • @pyd Regex is used to match the pattern not compare. Here we compare two strings to increase the count. Don't fall into misconception that we can compare two different cases as one since both have different ascii values. .tolower or toupper is the fastest approach. – Bharath M Shetty Oct 29 '17 at 03:16
  • Thanks for the clear explanation @Bharath shetty, Psidom . From where I can learn more about pandas any online resources? – Pyd Oct 29 '17 at 03:19
  • 3
    Keep tuned to stack overflow try to answer questions, best way to learn pandas. – Bharath M Shetty Oct 29 '17 at 03:19