1

I have a list of company names in a pandas data frame, I want group these names that are similar,review and create a standard name for each group. most of the solutions i see are to map a value to standard value but i want to just group the list that are similar. in many cases they may not start with same word

Ex : 

    ANADARKO E & P CO LP
    E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
    E & P ONSHORE LLC ANADARKO 
    PET ANADARKO 
    ANADARKO PET CORP
    ANADARKO PETROLEUM CORPORATION
    PROD ANADARKO 
    ANADARKO PROD CO
    ANADARKO PRODUCTION COMPANY

If i have a standard list then fuzzywuzzy is great to use, how do we group values when there is no standard list?

Vaibav
  • 77
  • 1
  • 7

3 Answers3

0

This should solve your problem!

#Create a df

data = {'names': ['ANADARKO E & P CO LP',
    'E & P COMPANY ANADARKO  LIMITED PRTNRSHIP',
    'E & P ONSHORE LLC ANADARKO ',
    'PET ANADARKO ',
    'ANADARKO PET CORP',
    'ANADARKO PETROLEUM CORPORATION',
    'PROD ANADARKO ',
    'ANADARKO PROD CO',
    'ANADARKO PRODUCTION COMPANY', 'test', 'test2']}

df = pd.DataFrame(data)
print(df)

                                        names
0                        ANADARKO E & P CO LP
1   E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
2                 E & P ONSHORE LLC ANADARKO 
3                               PET ANADARKO 
4                           ANADARKO PET CORP
5              ANADARKO PETROLEUM CORPORATION
6                              PROD ANADARKO 
7                            ANADARKO PROD CO
8                 ANADARKO PRODUCTION COMPANY
9                                        test
10                                      test2

#find str 'ANADARKO' in that df

look = df[df['names'].str.contains('ANADARKO')]
print(look)

                                       names
0                       ANADARKO E & P CO LP
1  E & P COMPANY ANADARKO  LIMITED PRTNRSHIP
2                E & P ONSHORE LLC ANADARKO 
3                              PET ANADARKO 
4                          ANADARKO PET CORP
5             ANADARKO PETROLEUM CORPORATION
6                             PROD ANADARKO 
7                           ANADARKO PROD CO
8                ANADARKO PRODUCTION COMPANY
  • I have a list of more than 50000 names and i will not have any keyword like ANADARKO to pass. I want to group them without passing any keywords – Vaibav Jul 22 '20 at 04:07
0

Check this link out - https://towardsdatascience.com/group-thousands-of-similar-spreadsheet-text-cells-in-seconds-2493b3ce6d8d

Probably want to first run CleanCo to standardize names

from textpack import tp
from cleanco import cleanco


df['Name_Trimmed']=df['names'].apply(lambda x: cleanco(x).clean_name() if type(x)==str else x)

then use ngrams and TDIF to use his code -

new_df=tp.read_csv('./_________.csv',['Name_Trimmed'], match_threshold=0.85,ngram_remove=r'[,-./]')
new_df.run()
new_df.export_csv('./ngram_grps.csv')
df2= pd.read_csv('ngram_grps.csv')
print("Ngram group Count =",len(df2['Group'].unique()))
bdubs88
  • 19
  • 5
-1

How about doing it this way?

document = ["This is the most beautiful place in the world.", "This man has more skills to show in cricket than any other game.", "Hi there! how was your ladakh trip last month?", "There was a player who had scored 200+ runs in single cricket innings in his career.", "I have got the opportunity to travel to Paris next year for my internship.", "May be he is better than you in batting but you are much better than him in bowling.", "That was really a great day for me when I was there at Lavasa for the whole night.", "That’s exactly I wanted to become, a highest ratting batsmen ever with top scores.", "Does it really matter wether you go to Thailand or Goa, its just you have spend your holidays.", "Why don’t you go to Switzerland next year for your 25th Wedding anniversary?", "Travel is fatal to prejudice, bigotry, and narrow mindedness., and many of our people need it sorely on these accounts.", "Stop worrying about the potholes in the road and enjoy the journey.", "No cricket team in the world depends on one or two players. The team always plays to win.", "Cricket is a team game. If you want fame for yourself, go play an individual game.", "Because in the end, you won’t remember the time you spent working in the office or mowing your lawn. Climb that goddamn mountain.", "Isn’t cricket supposed to be a team sport? I feel people should decide first whether cricket is a team game or an individual sport."]


from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import numpy as np
import pandas as pd

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(document)

true_k = 2
model = KMeans(n_clusters=true_k, init='k-means++', max_iter=100, n_init=1)
model.fit(X)

order_centroids = model.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names()

for i in range(true_k):
    print("Cluster %d:" % i)
    for ind in order_centroids[i, :10]:
            print('%s' % terms[ind])
 

Result...

Cluster 0:
cricket
team
game
world
better
year
really
travel
place
beautiful
Cluster 1:
worrying
road
enjoy
journey
stop
potholes
year
highest
goa
goddamn

Finally...you can use this to make predictions...

print("\n")
print("Prediction")
X = vectorizer.transform(["Nothing is easy in cricket. Maybe when you watch it on TV, it looks easy. But it is not. You have to use your brain and time the ball."])
predicted = model.predict(X)
print(predicted)

Result...

Prediction
[1]
ASH
  • 20,759
  • 19
  • 87
  • 200