0

the goal:

Hey guys, I'm trying to create a classification model in Python to predict when a bike-share station will have too much relative inflow or outflow per hour.

what we're workin with:

The first 5 rows of my dataframe (over 200,000 rows in all) look like this, and I've assigned values 0, 1, 2 in the 'flux' column - 0 if no significant action, 1 if too much inflow, 2 if too much outflow.

enter image description here

And I'm thinking of using the station_name (over 300 stations), hour of day, and day of week as the predictor variables to classify 'flux'.

the model choice:

What should I go with? Naive Bayes? KNN? Random Forest? anything else that would be a good fit? GDMs? SVMs?

fyi: the baseline prediction of always 0 is pretty high at 92.8%. unfortunately the accuracy of logistic regression and decision tree is right on par w that and doesn't improve it much. and KNN just takes forever....

Recommendations from those more experienced with machine learning in dealing with a classification question like this?

Community
  • 1
  • 1
SpicyClubSauce
  • 4,076
  • 13
  • 37
  • 62
  • I think you should use the `sqrt(abs(level_of_the_tides + distance_from_timesquare- number_of_days_to_fullmoon)/Math.PI)` – Joran Beasley Jul 24 '15 at 23:34
  • 1
    @JoranBeasley already ran it. good accuracy, but ROC leaves a bit to be desired. – SpicyClubSauce Jul 24 '15 at 23:36
  • It seems that your data is unbalanced, so you cannot evaluate the model simply by accuracy. – yangjie Jul 25 '15 at 02:02
  • There are tips for this in answers at http://stackoverflow.com/questions/2595176/when-to-choose-which-machine-learning-classifier, http://blog.echen.me/2011/04/27/choosing-a-machine-learning-classifier/ and more deeply at http://nlp.stanford.edu/IR-book/html/htmledition/choosing-what-kind-of-classifier-to-use-1.html. It is difficult to predict in advance what model will work best and possibly a combination of models is better than any single one as was the case for the winner of the Netflix challenge. –  Jul 25 '15 at 02:25
  • @yangjie Could you elaborate on what else you could use to evaluate? Also, I'm trying to lower thresholds to make my model more balanced (baseline predicting accuracy 75%, no luck doing much better w basic models so far). – SpicyClubSauce Jul 25 '15 at 03:55

2 Answers2

5

The Azure machine learning team has an article on how to choose algorithms which could help even if you aren't using AzureML. From that article:

How large is your training data? If your training set is small, and you're going to train a supervised classifier, then machine learning theory says you should stick to a classifier with high bias/low variance, such as Naive Bayes. These have an advantage over low bias/high variance classifiers such as kNN since the latter tends to overfit. But low bias/high variance classifiers are more appropriate if you have a larger training set because they have a smaller asymptotic error - in these cases a high bias classifier isn't powerful enough to provide an accurate model. There are theoretical and empirical results that indicate that Naive Bayes does well in such circumstances. But note that having better data and good features usually can give you a greater advantage than having a better algorithm. Also, if you have a very large dataset classification performance may not be affected as much by the algorithm you use, so in that case it's better to choose your algorithm based on such things as its scalability, speed, or ease of use.

Do you need to train incrementally or in a batched mode? If you have a lot of data, or your data is updated frequently, you probably want to use Bayesian algorithms that update well. Both neural nets and SVMs need to work on the training data in batch mode.

Is your data exclusively categorical or exclusively numeric or a mixture of both kinds? Bayesian works best with categorical/binomial data. Decision trees can't predict numerical values.

Do you or your audience need to understand how the classifier works? Bayesian or decision trees are more easily explained. It's much harder to see or explain how neural networks and SVMs classify data.

How fast does your classification need to be generated? Decision trees can be slow when the tree is complex. SVMs, on the other hand, classify more quickly since they only need to determine which side of the "line" your data is on.

How much complexity does the problem present or require? Neural nets and SVMs can handle complex non-linear classification.

Now, regarding your comment about "fyi: the baseline prediction of always 0 is pretty high at 92.8%": there are anomaly detection algorithms - meaning that the classification is highly unbalanced, with one classification being an "anomaly" that occurs very rarely, like credit card fraud detection (true fraud is hopefully a very small percentage of your total dataset). In Azure Machine Learning, we use one-class support vector machine (SVM) and PCA-based anomaly detection algorithms. Hope that helps!

Jennifer Marsman - MSFT
  • 5,167
  • 1
  • 25
  • 24
0

Just use anything different from average accuracy for model evaluation in case of such unbalanced data: precision/recall/f1/confusion matrix:

http://scikit-learn.org/stable/modules/generated/sklearn.metrics.precision_recall_fscore_support.html

http://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics

Try different models and choose best according to chosen metrics on test set.

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52