Questions tagged [scikit-learn]

Scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and commercially usable (BSD license).

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD license).

Resources

Related Libraries

  • sklearn-pandas - bridge library between scikit-learn and
  • scikit-image - scikit-learn-compatible API for image processing and computer vision for machine learning tasks
  • sklearn laboratory - scikit-learn wrapper that enables running larger scikit-learn experiments and feature sets
  • sklearn deap - scikit-learn wrapper that enables hyper parameter tuning using evolutionary algorithms instead of gridsearch in scikit-learn
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn
  • scikit-plot - visualization library for quickly generating common plots in machine learning studies
  • sklearn-porter - library for turning trained scikit-learn models into compiled , , or code
  • sklearn_theano - scikit-learn-compatible objects (estimators, transformers, and datasets) using internally
  • sparkit-learn - scikit-learn API that uses 's distributed computing model
  • joblib - scikit-learn parallelization library
28024 questions
187
votes
9 answers

what is the difference between 'transform' and 'fit_transform' in sklearn

In the sklearn-python toolbox, there are two functions transform and fit_transform about sklearn.decomposition.RandomizedPCA. The description of two functions are as follows But what is the difference between them ?
tqjustc
  • 3,624
  • 6
  • 27
  • 42
176
votes
2 answers

How does the class_weight parameter in scikit-learn work?

I am having a lot of trouble understanding how the class_weight parameter in scikit-learn's Logistic Regression operates. The Situation I want to use logistic regression to do binary classification on a very unbalanced data set. The classes are…
kilgoretrout
  • 3,547
  • 5
  • 31
  • 46
170
votes
10 answers

RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility

I have this error for trying to load a saved SVM model. I have tried uninstalling sklearn, NumPy and SciPy, reinstalling the latest versions all-together again (using pip). I am still getting this error. Why? In [1]: import sklearn; print…
Blue482
  • 2,926
  • 5
  • 29
  • 40
168
votes
28 answers

How to convert a Scikit-learn dataset to a Pandas dataset

How do I convert data from a Scikit-learn Bunch object to a Pandas DataFrame? from sklearn.datasets import load_iris import pandas as pd data = load_iris() print(type(data)) data1 = pd. # Is there a Pandas method to accomplish this?
SANBI samples
  • 2,058
  • 2
  • 14
  • 20
161
votes
9 answers

Can anyone explain me StandardScaler?

I am unable to understand the page of the StandardScaler in the documentation of sklearn. Can anyone explain this to me in simple terms?
nitinvijay23
  • 1,781
  • 3
  • 13
  • 11
161
votes
6 answers

Parameter "stratify" from method "train_test_split" (scikit Learn)

I am trying to use train_test_split from package scikit Learn, but I am having trouble with parameter stratify. Hereafter is the code: from sklearn import cross_validation, datasets X = iris.data[:,:2] y =…
Daneel Olivaw
  • 2,077
  • 4
  • 15
  • 23
156
votes
3 answers

How can I plot a confusion matrix?

I am using scikit-learn for classification of text documents(22000) to 100 classes. I use scikit-learn's confusion matrix method for computing the confusion matrix. model1 = LogisticRegression() model1 = model1.fit(matrix, labels) pred =…
minks
  • 2,859
  • 4
  • 21
  • 29
149
votes
4 answers

What is exactly sklearn.pipeline.Pipeline?

I can't figure out how the sklearn.pipeline.Pipeline works exactly. There are a few explanation in the doc. For example what do they mean by: Pipeline of transforms with a final estimator. To make my question clearer, what are steps? How do they…
farhawa
  • 10,120
  • 16
  • 49
  • 91
148
votes
2 answers

Logistic regression python solvers' definitions

I am using the logistic regression function from sklearn, and was wondering what each of the solver is actually doing behind the scenes to solve the optimization problem. Can someone briefly describe what "newton-cg", "sag", "lbfgs" and "liblinear"…
Clement
  • 1,630
  • 3
  • 12
  • 10
147
votes
11 answers

How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?

I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want. features = df[["col1", "col2", "col3",…
Louic
  • 2,403
  • 3
  • 19
  • 34
143
votes
5 answers

What are the pros and cons between get_dummies (Pandas) and OneHotEncoder (Scikit-learn)?

I'm learning different methods to convert categorical variables to numeric for machine-learning classifiers. I came across the pd.get_dummies method and sklearn.preprocessing.OneHotEncoder() and I wanted to see how they differed in terms of…
O.rka
  • 29,847
  • 68
  • 194
  • 309
143
votes
7 answers

How are feature_importances in RandomForestClassifier determined?

I have a classification task with a time-series as the data input, where each attribute (n=23) represents a specific point in time. Besides the absolute classification result I would like to find out, which attributes/dates contribute to the result…
user2244670
  • 1,431
  • 2
  • 10
  • 3
141
votes
4 answers

How to compute precision, recall, accuracy and f1-score for the multiclass case with scikit learn?

I'm working in a sentiment analysis problem the data looks like this: label instances 5 1190 4 838 3 239 1 204 2 127 So my data is unbalanced since 1190 instances are labeled with 5. For the classification Im…
140
votes
4 answers

Sklearn, gridsearch: how to print out progress during the execution?

I am using GridSearch from sklearn to optimize parameters of the classifier. There is a lot of data, so the whole process of optimization takes a while: more than a day. I would like to watch the performance of the already-tried combinations of…
doubts
  • 1,763
  • 2
  • 12
  • 19
138
votes
4 answers

What are the different use cases of joblib versus pickle?

Background: I'm just getting started with scikit-learn, and read at the bottom of the page about joblib, versus pickle. it may be more interesting to use joblib’s replacement of pickle (joblib.dump & joblib.load), which is more efficient on big…
msunbot
  • 1,871
  • 4
  • 16
  • 16