-2

I am classifying a client's clients. However, the data is fluid and the clusters can change every day.

Running new clusters daily to update user clusters is difficult because Kmeans is inconsistent in labeling clusters.

If we cluster, and then train the data say with a Neural networks or XGBoost and moving forward simply predict the clusters. Does this make sense or is it a good way to do things?

acacia
  • 1,375
  • 1
  • 14
  • 40
  • If i understood your problem, you want to train a model(classifier), which will work in real world and you want to update it everyday? – Ankish Bansal Jan 17 '19 at 14:43
  • The users are the same but their activity changes hourly and every day, I want to reclassify them. – acacia Jan 17 '19 at 21:26

1 Answers1

0

Yeah, it does make sense, it's just a regular classification task at that point. You should have enough data assigned to clusters though before moving on to neural network.

On the other hand, why don't you predict clusters for new points instead of updating them (you can see separate methods for fit and predict in sklearn's docs, though it depends on technology you are using)? Remember, that neural network will only be as good as it's input (K-Means clusters) and it's predictions will probably be similiar to K-Means.

Furthermore, NNs are more complicated and harder to train, maybe those shouldn't be you first choice.

You could check the idea of fuzzy clustering as well, as the data is fluid it might be a better fit for your case. Maybe autoencoders, as a method of obtaining latent variables, might be of use as well.

Szymon Maszke
  • 22,747
  • 4
  • 43
  • 83
  • Have sufficient data from 22m rows, 388k. Going to run logistic, Xgboost, LGBM, then neural. Then tests tune and possibly blend. – acacia Jan 17 '19 at 11:58
  • The clients transact daily and their metrics, attributes change and move between clusters! – acacia Jan 17 '19 at 12:02
  • Why go all in just to learn KMeans clustering. Even more so when you have no metrics to verify those models performance? – Szymon Maszke Jan 17 '19 at 12:02
  • So check fuzzy clustering at first, it will give probabilities of belonging to each cluster and this prediction might indicate possible movements between clusters. – Szymon Maszke Jan 17 '19 at 12:04
  • Kmeans gives me an idea of the best separations of the entire db. Use it set the clusters with labels of choice. In future don't want to recluster with Kmeans, just want to predict the labels because Kmeans is inconsistent with labels and ofcourse random intialization of centroids. It was unlabeled data, labelling it and then learning it. – acacia Jan 17 '19 at 12:05
  • 1. K-Means does not give an idea of the best separation. It's a rather naive method, furthermore you should define 'best' when it comes to clustering. 2. If you set the labels with K-Means, than use models to predict those labels, it will, at best, learn to perform K-Means predictions. Why not to use K-Means predictions instead? Check the link I have provided in my answer. 3. Random initialization of centroids - yes, but to a lesser degree, check K-Means++ initialization 4. I repeat; you don't have to 'recluster' with K-Means, once it's learned you can use it to perform predictions. – Szymon Maszke Jan 17 '19 at 12:11
  • Let me try saving Kmeans with Pickle, seen it here https://scikit-learn.org/stable/modules/model_persistence.html – acacia Jan 17 '19 at 12:36