0

My dataset as like below.

enter image description here

Subject column refers to Email SUbject and Problem description and Problem details column refers to Email body.

Based on both subject and emaail body keywords, i need to classify to which Queue it should belong to.

Previous queue column consists of 25+ different categories.

My dataframe shape is of (60697, 4).

Please advise on the approach i need to follow to classify. Which ML models i need to use to train the data and test the data.

I know a bit to use natural language tokenization concepts.

Classification is more like gmail inbox classification: Primary, Social and Promotions. However, here I have to categorize into 25+.

2 Answers2

2

I'd try the following:

  1. vectorize your subjects and email body using CountVectorizer or TfidfVectorizer, so you'll have your X matrix. You may want to test different ngram_range's in order to improve the prediction performance
  2. Factorize your classes, so you should have one integer for each class - this will be your y vector
  3. split your X into X_train and X_test and y into y_train and y_test
  4. train a LogisticRegression model using X_test and y_test
  5. test it on X_test and y_test ...
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Thank you Max for inputs! Let me try the approach as you said and hoping it works as I wanted it to. Thanks a lot! :) – Manikant Kella Feb 22 '18 at 12:57
  • 1
    Here's a recent blog article about multi-class classification similar to what @MaxU proposed https://towardsdatascience.com/multi-class-text-classification-with-scikit-learn-12f1e60e0a9f – Adnan S Feb 23 '18 at 04:34
0

You could give FastText a try. Here is a link to tutorial.

Fasttext uses the concept of word embeddings in the context of supervised classification. The key advantage of using fasttext is, it is very fast as it name says. It can handle 1000+ categories/labels easily.

vumaasha
  • 2,765
  • 4
  • 27
  • 41