Is there a way to create labels from a column of strings in a dataset using Python? The format of the string varies between 1 to ~10 describing words, where most of the time the category will be more than 1 word (example below).
I do not want to pre-populate the list of categories, but I would like the algorithm to create the categories based on common themes.
This is what I would like:
Input:
**RepairName**
windscreen chip repair
windscreen chip repair x2
windscreen chip repairs
x4 tyre replacement
head light globe replacement
headlight bulb replace
headlight globe replacement
headlight replacement
tyre repalcement
tyre replacement
tyre replacement lhr
Output:
**RepairName** **Category**
windscreen chip repair windscreen chip repair
windscreen chip repair x2 windscreen chip repair
windscreen chip repairs windscreen chip repair
x4 tyre replacement tyre replacement
head light globe replacement headlight replacement
headlight bulb replace headlight replacement
headlight globe replacement headlight replacement
headlight headlight replacement
headlight replacement headlight replacement
tyre repalcement tyre replacement
tyre replacement lhr tyre replacement
tyre replacement tyre replacement
This is what I have tried:
- Loads of sentiment analysis examples, but I my data is not positive or negative
I tried Counter from Collections - but this only counts the number of words in the dataset column - rather than common theme (see example below)
nltk.FreqDist but that counted 1 per observation (see second example below)
I have tried wordnet from nltk.corpus which seems to categorise 1 word strings only
First example - counter from collections
from collections import Counter
categories = Counter(" ".join(df["RepairName"]).split()).most_common(1000)
Results: counter from collections results
nltk.FreqDist example:
categories2 = df.groupby(['RepairName']).apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))
nltk.FreqDist results:
wordnet example
df['C'] = df['RepairName'].apply(wordnet.synsets)
I understand this is probably quite a complex beast of a subject, but any advice would be greatly appreciated! I would do it manually but aside from wanting to learn something new, the dataset is around 50K observations