NLP - Trying to find similarities between different target groups based on input dimensions

Question

So I have a dataset which has one description column (an IT trouble ticket description) and one target column (grouping of the ticket e.g. ticket belongs to Group 0 or Group 1 - the group type e.g. access issues is not provided).

The thing is: I have 45 different target variables - targets are Group 0, Group 1,...... Group 45. There is a pretty long tail with some of these group having less than 0.1% of the total tickets. Now instead of just directly clubbing them together to form a single group, I wanted to see if there was any way to club these smaller groups with other groups which are 'similar' to them based on the IT trouble ticket description. For example, if a larger group has tickets describing access issues and a smaller group has tickets pertaining to login issues (depending on the text description), I would prefer to club these two groups together.

I thought of creating a separate Word2Vec or Glove embedding for each Group but then am unable to figure out how to find similarities between these vectors. Further, creating 45 different Word2Vec embeddings is pretty computationally painful. So I am a little stuck on this. Any ideas on how to approach this? Any help would be great

Thanks !

What's the ultimate goal? (Why is the similarity between some of the groups important?) Why aren't the ticket groups descriptively labeled? Why are you considering having many separate models, rather than just one trained on the entire dataset? (Word2vec benefits from large, varied training sets – N groups of documents all combined into one corpus, as long as they're in the same language & general domain, will likely yield better word-vectors than N separate data-starved models.) — gojomo, May 13 '20 at 16:26
On why the ticket groups haven't been descriptively labeled, honestly, it is just the data set that has been provided. This is a project I am doing as part of an academic program. I had already tried training the entire data set through a bi-dir RNN and LSTM using uni, bi and trigram W2V embeds but the accuracy was low (49 - 58%)and low precision /recall across. I also tried GloVe but with no major improvement. Hence, I think I could explore clubbing these target groups. But I was worried that the variance in the content of the clubbed group would be high cos they are unrelated. Ur thoughts? — Swami, May 13 '20 at 18:30
But what's the goal of the training/exercise/project? Are you trying to be able to classify new unknown texts into existing groups? If so, there are text-preprocessing tricks to try, & alternative text-classification algorithms, and combinations of techniques. But, if you were planning to create separate word2vec models for each subgroup, that's exactly the wrong direction if classification of future unknowns using shared language was the goal. And if your actual final project goal is something else, that's important to decide what to try. — gojomo, May 13 '20 at 21:41

NLP - Trying to find similarities between different target groups based on input dimensions

0 Answers0