2

I have a large data frame with 371 unique categorical entries, however some of the entries are similar and in some cases I want to merge certain categories that may have been seperated, for example I have 3 categories that I know of:

3d

3d_platformer

3d_vision

I want to combine these under a general category of just 3d. I feel like this should be possible on a small scale, but I want to scale it up for all the categories as well. The problem being that I don't know the names of all my categories. So in short the full question is:

How can I search for similar categorical names and then replace all the similar name with one group name, with out searching individually?

1 Answers1

0

Can regular expressions help?

df.col = df.col.str.replace(r'3d.*', '3d')

If you're looking for more semantical-like identity, the NLP libraries like Gensim may provide string similarity computing methods:

https://betterprogramming.pub/introduction-to-gensim-calculating-text-similarity-9e8b55de342d

You can try to use your category names as corpus.

E-Newman
  • 39
  • 6
  • This could work on the small scale of the ones that I know are similar, but still runs into the problems on the unknown categories. – Mackenzie Unger Oct 22 '21 at 21:23
  • What do you mean by "unknown"? Synonyms, like "3d" and "three-dimensional object"? – E-Newman Oct 22 '21 at 21:26
  • There are over 300 categories in my data set, it is very impractical to go through each similar name set to check for each one, at best I'll narrow this down to maybe 200 categories based on similar names. – Mackenzie Unger Oct 22 '21 at 23:51
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 23 '21 at 00:21
  • @MackenzieUnger , I've added info about NLP tools to the answer. Hopefully they are the tool you're looking for. – E-Newman Oct 23 '21 at 07:32