Find repeated words in a column and sort it according to number of occurence using pandas

Question

        A                             B
 1) Italy                   Transport for London.....
 2) Italy                   Roseanne Barr Actor leavin..... 
 3) America                 Americas Transport for London........ 
 4) America                 Transport for London.....
 5) America                 Roseanne Barr Actor leavin..... 
 6) France                  Americas Transport for London........

This is a sample of my excel sheet I want it sorted according to the number of occurence in column A. Like America is present 3 times in column then rows with America Should be shifted to The Top and Rows with Italy Following America and then rows with france should be at end because it has lowest occurence in the column A.

score 1 · Accepted Answer · answered May 31 '18 at 07:09

Use transform with argsort in descending order for positions and select by iloc:

df = df.iloc[(-df.groupby('A')['A'].transform('size')).argsort()]
print (df)
          A                              B
3)  America  Americas Transport for London
4)  America           Transport for London
5)  America     Roseanne Barr Actor leavin
1)    Italy           Transport for London
2)    Italy     Roseanne Barr Actor leavin
6)   France  Americas Transport for London

Or create new column and sort:

df['new'] = df.groupby('A')['A'].transform('size')

df = df.sort_values('new', ascending=False)
print (df)
          A                              B  new
3)  America  Americas Transport for London    3
4)  America           Transport for London    3
5)  America     Roseanne Barr Actor leavin    3
1)    Italy           Transport for London    2
2)    Italy     Roseanne Barr Actor leavin    2
6)   France  Americas Transport for London    1

jpp · Answer 2 · 2018-05-31T16:27:21.187

0

Using collections.Counter to create a dictionary of counts:

from collections import Counter

df = pd.DataFrame([['Italy', 'Transport for London'],
                   ['Italy', 'Roseanne Barr Actor leavin'],
                   ['America', 'Americas Transport for London'],
                   ['America', 'Transport for London'],
                   ['America', 'Roseanne Barr Actor leavin'],
                   ['France', 'Americas Transport for London']],
                  columns=['A', 'B'])

# calculate counts
c = Counter(df['A'])

# apply reordering
df = df.iloc[df['A'].map(c).argsort()[::-1]]

# save to excel
df.to_excel('file.xlsx', index=False)

Result:

print(df)

         A                              B
4  America     Roseanne Barr Actor leavin
3  America           Transport for London
2  America  Americas Transport for London
1    Italy     Roseanne Barr Actor leavin
0    Italy           Transport for London
5   France  Americas Transport for London

edited May 31 '18 at 16:27

answered May 31 '18 at 08:31

jpp

159,742
34
281
339

This code gives me a messed up output when data is big. For example America Is on top because it is three times used. but when we go to the middle of the document there are some mixed results like france and italy are repeated again and again. To be exact First few entries are sorted fine and following entries are messed up – Tayyab Nasir May 31 '18 at 16:20
@TayyabNasir, To be *exact*, you would have to show a [mcve]. However, since you have another solution all well & good. For what it's worth `collections.Counter` is part of the standard library, it's highly unlikely it has a bug. – jpp May 31 '18 at 16:21
just noticed there is same problem in other solution as well. I think problem is in the part where I am saving the data. please can you edit your code with saving it to excel part also? – Tayyab Nasir May 31 '18 at 16:25

Find repeated words in a column and sort it according to number of occurence using pandas

2 Answers2

Linked