Which supervised classifiers in scikit-learn are recommended for large datasets?

Question

There are many supervised classifier algorithms available in scikit-learn but I couldn't find any information about their scaalbility regarding large datasets. I know that for instance, support vector machines don't behave well with huge datasets, but what about others? Which supervised/semi-supervised classifier algorithms are most suitable for large datasets?

For example: everything based on Stochastic gradient descent: ```SGDClassifier``` (includes linear SVM) and probably most of ```linear_model``` if the right methods are chosen (docs). Also ```LinearSVC```. But *huge* is subjective. — sascha, Oct 23 '17 at 11:27
Cf. also the [scikit-learn algorithm cheat sheet](http://scikit-learn.org/stable/tutorial/machine_learning_map/index.html). — σηγ, Oct 23 '17 at 17:22

score 0 · Answer 1 · answered Oct 23 '17 at 08:29

By huge datasets you mean like the "iris" deafult dataset?

Depending on what you want to do with those algorithms, like training and fitting, for example. I am gonna write down the ones I use for BIG datasets, and work fine.

from sklearn.cross_validation import train_test_split
from sklearn import datasets, svm\n
import numpy as np\n
import matplotlib.pyplot as plt\n
from sklearn.model_selection import GridSearchCV\n
from sklearn.metrics import mean_squared_error\n
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import SGDRegressor\n

But of course you need to know what do you want to do with them. Here you can check everything you want to know about these or many more. http://scikit-learn.org/stable/

score 0 · Accepted Answer · answered Oct 24 '17 at 07:12

0

If you are specifically looking for classifiers in sklearn, you can have a look at this link : Scaling Strategies for large datasets.

Generally, the classifiers do incremental learning on your dataset by creating mini-batches. Here are some link for reference :

Incremental Learning links

You can have a look at these classifiers in SKlearn for more info

If your data is given as a stream during input, you can have a look at Apache Spark Streaming and jump to MlLib in Apache Spark for more info.

You can also have a look at Feature Hasher for large scale feature hashing in sklearn.

answered Oct 24 '17 at 07:12

Gambit1614

8,547
1
25
51

1

@https://stackoverflow.com/users/8160718/mohammed-kashif thank you, this is great source of information! What about semi-supervised classifiers (LabelPropagation and LabelSpreading), do they behave well with large datasets? – zlatko Oct 24 '17 at 09:16
@zlatko Your Welcome ! Although I am not sure about semi-supervised classifiers. Will need to look that up. Will update you once I find something relevant. – Gambit1614 Oct 24 '17 at 09:17
1

How these incremental classifiers behave if standard fit method is used instead of partial_fit? Any performance difference to other algorithms if partial_fit is not used? – zlatko Oct 24 '17 at 12:10

Which supervised classifiers in scikit-learn are recommended for large datasets?

2 Answers2