1

Is there a way to create labels from a column of strings in a dataset using Python? The format of the string varies between 1 to ~10 describing words, where most of the time the category will be more than 1 word (example below).

I do not want to pre-populate the list of categories, but I would like the algorithm to create the categories based on common themes.

This is what I would like:

Input:

**RepairName**
windscreen chip repair
windscreen chip repair x2
windscreen chip repairs
x4 tyre replacement
head light globe replacement
headlight bulb replace
headlight globe replacement
headlight replacement
tyre repalcement
tyre replacement
tyre replacement lhr

Output:

**RepairName**                    **Category**            
windscreen chip repair             windscreen chip repair
windscreen chip repair x2          windscreen chip repair
windscreen chip repairs            windscreen chip repair
x4 tyre replacement                tyre replacement
head light globe replacement       headlight replacement
headlight bulb replace             headlight replacement
headlight globe replacement        headlight replacement
headlight                          headlight replacement
headlight replacement              headlight replacement
tyre repalcement                   tyre replacement
tyre replacement lhr               tyre replacement
tyre replacement                   tyre replacement

This is what I have tried:

  1. Loads of sentiment analysis examples, but I my data is not positive or negative
  2. I tried Counter from Collections - but this only counts the number of words in the dataset column - rather than common theme (see example below)

  3. nltk.FreqDist but that counted 1 per observation (see second example below)

  4. I have tried wordnet from nltk.corpus which seems to categorise 1 word strings only

First example - counter from collections

    from collections import Counter
    categories = Counter(" ".join(df["RepairName"]).split()).most_common(1000)

Results: counter from collections results

nltk.FreqDist example:

    categories2 = df.groupby(['RepairName']).apply(lambda x: nltk.FreqDist(nltk.tokenize.word_tokenize(' '.join(x))))

nltk.FreqDist results:

nltk.FreqDist results

wordnet example

df['C'] = df['RepairName'].apply(wordnet.synsets)

I understand this is probably quite a complex beast of a subject, but any advice would be greatly appreciated! I would do it manually but aside from wanting to learn something new, the dataset is around 50K observations

sab
  • 87
  • 2
  • 11
  • **Food for thought:** Can unsupervised models really create a classification model with labels? – alvas Jun 15 '20 at 08:37
  • **More food**: Why is it call **un**supervised? What is a synset? What is the ultimate task you want to achieve (definitely the pre-defined labels in your example isn't the final task)? – alvas Jun 15 '20 at 08:39
  • Thanks @alvas. I thought it may be an unsupervised ML algorithm, but every time I Googled, I found supervised algorithms. Are you inferring that this may not be possible? Or rather, if it is, it would be quite difficult. I have started to manually label the data and this may be the best solution. The final task is to create new categories and only offer these to users as options in future rather than allowing unstructured data. – sab Jun 15 '20 at 22:56
  • 1
    https://stackoverflow.com/questions/60359628/generating-dictionaries-to-categorize-tweets-into-pre-defined-categories-using-n =) – alvas Jun 16 '20 at 00:13

0 Answers0