1

To improve the recomender system for Buyer Material Groups, our company is willing to train a model using customer historial spend data. The model should be trained on historical "Short text descriptions" to predict the appropriate BMG. The dataset has more that 500.000 rows and the text descriptions are multilingual (up to 40 characters).

1.Question: can i use supervised learning if i consider the fact that the descriptions are in multiple languages? If Yes, are classic approaches like multinomial naive bayes or SVM suitable?

2.Question: if i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?

if you have other ideas or approaches please feel free :). (It is a matter of a simple text classification problem)

yolo25
  • 45
  • 6

1 Answers1

1

Can I use supervised learning if i consider the fact that the descriptions are in multiple languages?

Yes, this is not a problem except it makes your data more sparse. If you actually only have 40 characters (is that not 40 words?) per item, you may not have enough data. Also the main challenge for supervised learning will be whether you have labels for the data.

If Yes, are classic approaches like multinomial naive bayes or SVM suitable?

They will work as well as they always have, though these days building a vector representation is probably a better choice.

If i want to improve the first model in case it is not performing well, and use unsupervised multilingual emdedding to build a classifier. how can i train this classifier on the numerical labels later?

Assuming the numerical labels are labels on the original data, you can add them as tokens like LABEL001 and the model can learn representations of them if you want to make an unsupervised recommender.


Honestly these days I wouldn't start with Naive Bayes or classical models, I'd go straight to word vectors as a first test for clustering. Using fasttext or word2vec is pretty straightforward. The main problem is that if you really only have 40 characters per item, that just might not be enough data to cluster usefully.

polm23
  • 14,456
  • 7
  • 35
  • 59
  • each discription(row)has up to 40 char not words. Considering that the data is a customer historical spend data with more than 500.000 rows. Buyer Material Groups are about 230 distributed on the 500.000 rows which means each Buyer Matreial Group exists more than one time. The free text descriptions are something like : " Electrical screwdriver calibration" "Engineering Support 9903211" "First Aid course Katarzyna(27/11/17)" in more than 5 languages including chinese. BMG clolumn is numeric: "14060103" "10020100" etc. Can i still use Fasttext or Word2vec for this classification problem?thanks – yolo25 May 07 '20 at 10:54
  • Ah OK, so the Buyer Material Groups are your labels or classes. Yes, you can use fasttext or word2vec, though working with short documents like that may be more challenging. I would try the basic FastText text classification and see how it goes. A related problem for very short text is line-item classification on shopping receipts, which may be a useful search term. https://github.com/facebookresearch/fastText/tree/master/python#text-classification-model – polm23 May 07 '20 at 14:15
  • See here for an example of classifying short text from receipts. https://medium.com/blogaboutgoodscompany/receipt-labels-classification-word2vec-and-cnn-approach-9233f599c2aa – polm23 May 08 '20 at 03:54