5

In a pickle...

I have a dataset with >100,000 observations; datasets' columns include CustomerID, VendorID, ProductID and CatNMap. Here is what it looks like:

enter image description here

As you can see values represented in first 3 columns (CustomerID, VendorID, ProductID) represent unique numerical mapped values and would make no sense if represented on x,y plane (which eliminates use of a lot of Classification methods); last column has strings with categories assigned by customers. Now, here is the part that I do not understand and not sure how to approach...

Goal: is to predict CatNMap values in the future for customers, however as I see it the features I have here are not useful, is that true? Now if they are, what method can I use as CatNMap column has >7,000 unique values; also, how would any method deal with categorizing future items if let's say for the same product there are 2 or more different categories assigned by different customers? Do I need to Implement NN for this one?

All answers are appreciated!

DGomonov
  • 715
  • 1
  • 7
  • 19
  • Please don't forget to upvote all working answers, and accept the one you like the most. Probably you know this, but this is to let the community know which answers were useful and to reward the people for their time and effort as well ;) See this meta.stackexchange.com/questions/5234/ and meta.stackexchange.com/questions/173399/ – alan.elkin Mar 07 '20 at 19:36

2 Answers2

2

As I understand it, your goal is to predict CatNMap (your output data) based on the the first 3 columns (your input data as features).

As you said before, (CustomerID, VendorID, ProductID) are 3 categorical variables, meaning that the value they may have is not related to a quantity, but to a category. So two consecutive values may have nothing to do with their actual meaning. As I see it, the same happens to your output CatNMap.

Having said that, there are several ways to treat categorical variables. On my experince, for your problem I would try One Hot Encoding for all your data (CustomerID, VendorID, ProductID, CatNMap). Even more, if you find it possible, maybe it's worth a try to use embeddings for ProductID, CatNMap instead of OneHotEncoding.

As for which algorithm to use, it's definitely worth a try to train Random Forest and Multi Layer Perceptron models, and compare them after some tuning.

I found this guide useful where you can see some examples, but there many other resources out there dealing with this topic. You should also take a look at this.

alan.elkin
  • 954
  • 1
  • 10
  • 19
  • Hey, thanks for the suggestions! Would you say after One Hot Enconding is done to the dataset, using Random Forest is applicable...? Is Random Forest a good choice with datasets over 500K observations? – DGomonov Feb 14 '20 at 16:17
  • 1
    @DGomonov I've used Random Forest with sklearn's OHE without problems, I think you shouldn't have any as well. The more good examples you can show the algorithm, the better should it behave in the real world. Also, you may want to take a look [here](https://datascience.stackexchange.com/questions/26283/how-can-i-fit-categorical-data-types-for-random-forest-classification) – alan.elkin Feb 14 '20 at 18:08
0

the features seem to be unpredictive of the output even if they are predictive, 70,000 classes needs a massive data-sets for training, I think The problem will not be solved with the conventional methods,Let's think of some ideas

Mohammed Khalid
  • 155
  • 1
  • 6