3

I am using an e-commerce dataset to predict product categories. I use the product description and supplier code as features, and predict the product category.

from sklearn import preprocessing
from sklearn.feature_extraction.text import CountVectorizer
from sklearn import ensemble

df['joined_features'] = df['description'].astype(str) + ' ' + df['supplier'].astype(str) 

# split the dataset into training and validation datasets 
train_x, valid_x, train_y, valid_y = model_selection.train_test_split(df['joined_features'], df['category'])

# encode target variable 
encoder = preprocessing.LabelEncoder()
train_y = encoder.fit_transform(train_y)
valid_y = encoder.fit_transform(valid_y)

# count vectorizer object 
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(df['joined_features'])

# transform training and validation data
xtrain_count =  count_vect.transform(train_x)
xvalid_count =  count_vect.transform(valid_x)

classifier = ensemble.RandomForestClassifier()
classifier.fit(xtrain_count, train_y)
predictions = classifier.predict(feature_vector_valid)

I get ~90% accuracy with this prediction. I now want to predict more categories. These categories are hierarchical. The category I predicted was the main one. I want to predict a couple more.

As an example, I predicted clothing. Now I want to predict: Clothing -> Shoes

I tried joining both categories: df['category1'] + df['category2'] and predicting them as one, but I get around 2% accuracy, which is really low.

What is the proper way to make a classifier in a hierarchical fashion?

Edit: I compiled some fake data for a better understanding:

sample

From the first row: category 1 corresponds to Samsung, category 3 to electronics, and category 7 to TVs.

Snow
  • 1,058
  • 2
  • 19
  • 47
  • How many unique values are there in `category1` & `category2`? Is there any in `category2` with 2 parents (i.e. DAG vs Tree)? How balanced are sample size of the classes? – Shihab Shahriar Khan Oct 03 '20 at 00:58
  • 1
    @ShihabShahriarKhan sample size is imbalanced. There are many products in one category, but not many in others. Theres only one parent per subcategory. Theres around 200 unique values for each category. – Snow Oct 03 '20 at 15:27
  • Is there anyway you can post some example data, so that I may come up with an approach for you? – artemis Oct 06 '20 at 16:08
  • @wundermahn Unfortunately I can't post post the dataset because it is private. But I can say that the data is not small, around 400k distinct products. The products descriptions are in German, like so: Hdmi Kabel 18Gbit/s 3m. The supplier is given as an integer value. The categories to be predicted are also given as integer values. For example Category1= 56, Category2= 89, and category3=60. – Snow Oct 06 '20 at 16:21
  • I can't understand your question. If there are more than one subcategory then every subcategory is a category but not the opposite. In this case, your classes should be the subcategories not the categories themselves. – Yahya Oct 08 '20 at 12:34
  • @Yahya Category2 is subcategory of Category1. Category3 is subcategory of Category2. Is this more understandable for you? – Snow Oct 08 '20 at 12:35
  • Okay, so what is the depth of this? and your result is just for the root node in this hierarchy? And in your training, the y variable was Category1 vs what? – Yahya Oct 08 '20 at 12:37
  • yes, the result I have is just for category1. I don't have a hierarchical prediction. I did try a flat prediction by merging all categories, but that resulted in very low accuracy, hence the question. – Snow Oct 08 '20 at 12:41
  • 1
    Sorry I still did not get it. A sample example of your dataset (or a fake dataset similar to it) would make it clearer so you can get the help needed. I would suggest to create a few rows in Ms Excel (or similar) and take a snapshot and post it here. – Yahya Oct 08 '20 at 12:44
  • 1
    @Snow we are struggling to understand the data and how the classes relate to each other. Without that, we cannot produce a solution for you. – artemis Oct 08 '20 at 12:50
  • @Yahya okay I'll do that – Snow Oct 08 '20 at 13:06
  • @wundermahn I'll produce a fake dataset with some rows. I hope that will be enough – Snow Oct 08 '20 at 13:07
  • @Snow wonderful, we will be happy to help :) – artemis Oct 08 '20 at 13:14
  • @Yahya I hope its clearer now – Snow Oct 08 '20 at 13:51
  • @wundermahn does my requested approach fit this classification? I added some sample data – Snow Oct 08 '20 at 13:54
  • @Snow so you previously predicted `category_1`, with 90% accuracy. You now want to do what? – artemis Oct 08 '20 at 14:58
  • @wundermahn now I want to predict also `category_2` and `category_3`. So taking as example the first row, I want to predict `137`, not just `1`. Here's where hierarchical classifying would come in handy – Snow Oct 08 '20 at 15:05
  • @Snow in your example data, 3 always maps to 1, and 7 always maps to 3 – artemis Oct 08 '20 at 15:40
  • @wundermahn yes, this is also the case with the original dataset. `category_1=1` maps to many `category_2` values, but each `category_2` value has only 1 `category_1` – Snow Oct 08 '20 at 15:44
  • Are they always in this order, that is CategoryN is always a sub category of CategoryN-1 and maps only to it? For Example 7 --> 3 --> 1? Also, it is very crucial to know how many categories your have (i.e. columns after the attributes). Is there a linear dependency between the Categories starting from N --> N-1 --> N-2 --> ....---> 1 ? Can your run Spearman Correlation between CategoryN and CategoryN-1 – Yahya Oct 08 '20 at 17:41
  • @Yahya sorry for the late reply. They are always in that order, and I have only three categories. – Snow Oct 10 '20 at 15:59
  • Since your dataset is not public, no one can guarantee a 100% working solution. However, according to the info you provided, you should create your classifier by using Category3 only, then create a lookup table to map back to Category2 and Category1. Now for any new category (if any) in future, you update your model and your lookup table. Finally, if your classifier is doing well on Category1 only, that's not enough, and the true performance is on Category3 in particular. Even if you create a hierarchical classifier, in theory, it won't give better performance. – Yahya Oct 10 '20 at 16:39
  • By the way, your late answer made you lose the bounty anyway, and not getting the help needed. Anyway, according to my previous comment, the performance of your current classifier is an illusion IF it is mandatory to know the corresponding Category2 and Category3 in your prediction. – Yahya Oct 10 '20 at 16:42

1 Answers1

1

One idea might be to build a model using all of your level 2 categories, but feed the prediction probabilities for category1 into the model as an input feature.

Another idea is that you train a model for category2 only for category1==Clothing. Ideally you'd have other multiclass models to be conditionally called depending on what the prediction from category1 was. Obviously this increases the amount of work you'd have to do depending on how many category1's there are.

Josh
  • 1,493
  • 1
  • 13
  • 24
  • Your first paragraph is what I thought of doing too. But maybe there already is a defined way to go about this, like Decision Graphs or so. – Snow Sep 25 '20 at 07:48