Questions tagged [scikit-learn]

Scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and commercially usable (BSD license).

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD license).

Resources

Related Libraries

  • sklearn-pandas - bridge library between scikit-learn and
  • scikit-image - scikit-learn-compatible API for image processing and computer vision for machine learning tasks
  • sklearn laboratory - scikit-learn wrapper that enables running larger scikit-learn experiments and feature sets
  • sklearn deap - scikit-learn wrapper that enables hyper parameter tuning using evolutionary algorithms instead of gridsearch in scikit-learn
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn
  • scikit-plot - visualization library for quickly generating common plots in machine learning studies
  • sklearn-porter - library for turning trained scikit-learn models into compiled , , or code
  • sklearn_theano - scikit-learn-compatible objects (estimators, transformers, and datasets) using internally
  • sparkit-learn - scikit-learn API that uses 's distributed computing model
  • joblib - scikit-learn parallelization library
28024 questions
134
votes
10 answers

how to check which version of nltk, scikit learn installed?

In shell script I am checking whether this packages are installed or not, if not installed then install it. So withing shell script: import nltk echo nltk.__version__ but it stops shell script at import line in linux terminal tried to see in this…
nlper
  • 2,297
  • 7
  • 27
  • 37
133
votes
6 answers

Run an OLS regression with Pandas Data Frame

I have a pandas data frame and I would like to able to predict the values of column A from the values in columns B and C. Here is a toy example: import pandas as pd df = pd.DataFrame({"A": [10,20,30,40,50], "B": [20, 30, 10, 40,…
Michael
  • 13,244
  • 23
  • 67
  • 115
132
votes
3 answers

Why does one hot encoding improve machine learning performance?

I have noticed that when One Hot encoding is used on a particular data set (a matrix) and used as training data for learning algorithms, it gives significantly better results with respect to prediction accuracy, compared to using the original matrix…
maheshakya
  • 2,198
  • 7
  • 28
  • 43
132
votes
13 answers

ImportError in importing from sklearn: cannot import name check_build

I am getting the following error while trying to import from sklearn: >>> from sklearn import svm Traceback (most recent call last): File "", line 1, in from sklearn import svm File…
ayush singhal
  • 1,879
  • 2
  • 18
  • 33
130
votes
8 answers

UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples

I'm getting this weird error: classification.py:1113: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 in labels with no predicted samples. 'precision', 'predicted', average, warn_for)` but then it also prints the f-score the…
Sticky
  • 3,671
  • 5
  • 34
  • 58
130
votes
9 answers

Stratified Train/Test-split in scikit-learn

I need to split my data into a training set (75%) and test set (25%). I currently do that with the code below: X, Xt, userInfo, userInfo_train = sklearn.cross_validation.train_test_split(X, userInfo) However, I'd like to stratify my training…
pir
  • 5,513
  • 12
  • 63
  • 101
122
votes
3 answers

LogisticRegression: Unknown label type: 'continuous' using sklearn in python

I have the following code to test some of most popular ML algorithms of sklearn python library: import numpy as np from sklearn import metrics, svm from sklearn.linear_model import LinearRegression from…
mllamazares
  • 7,876
  • 17
  • 61
  • 89
122
votes
6 answers

Understanding min_df and max_df in scikit CountVectorizer

I have five text files that I input to a CountVectorizer. When specifying min_df and max_df to the CountVectorizer instance what does the min/max document frequency exactly mean? Is it the frequency of a word in its particular text file or is it the…
moeabdol
  • 4,779
  • 6
  • 44
  • 43
122
votes
10 answers

sklearn plot confusion matrix with labels

I want to plot a confusion matrix to visualize the classifer's performance, but it shows only the numbers of the labels, not the labels themselves: from sklearn.metrics import confusion_matrix import pylab as pl y_test=['business', 'business',…
hmghaly
  • 1,411
  • 3
  • 29
  • 47
119
votes
20 answers

Scikit-learn: How to obtain True Positive, True Negative, False Positive and False Negative

My problem: I have a dataset which is a large JSON file. I read it and store it in the trainList variable. Next, I pre-process it - in order to be able to work with it. Once I have done that I start the classification: I use the kfold cross…
116
votes
3 answers

Will scikit-learn utilize GPU?

Reading implementation of scikit-learn in TensorFlow: http://learningtensorflow.com/lesson6/ and scikit-learn: http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html I'm struggling to decide which implementation to…
blue-sky
  • 51,962
  • 152
  • 427
  • 752
115
votes
4 answers

ConvergenceWarning: lbfgs failed to converge (status=1): STOP: TOTAL NO. of ITERATIONS REACHED LIMIT

I have a dataset consisting of both numeric and categorical data and I want to predict adverse outcomes for patients based on their medical characteristics. I defined a prediction pipeline for my dataset like so: X =…
sums22
  • 1,793
  • 3
  • 13
  • 25
113
votes
6 answers

scikit-learn .predict() default threshold

I'm working on a classification problem with unbalanced classes (5% 1's). I want to predict the class, not the probability. In a binary classification problem, is scikit's classifier.predict() using 0.5 by default? If it doesn't, what's the default…
ADJ
  • 4,892
  • 10
  • 50
  • 83
112
votes
10 answers

sklearn: Found arrays with inconsistent numbers of samples when calling LinearRegression.fit()

Just trying to do a simple linear regression but I'm baffled by this error for: regr = LinearRegression() regr.fit(df2.iloc[1:1000, 5].values, df2.iloc[1:1000, 2].values) which produces: ValueError: Found arrays with inconsistent numbers of…
sunny
  • 3,853
  • 5
  • 32
  • 62
111
votes
8 answers

Accuracy Score ValueError: Can't Handle mix of binary and continuous target

I'm using linear_model.LinearRegression from scikit-learn as a predictive model. It works and it's perfect. I have a problem to evaluate the predicted results using the accuracy_score metric. This is my true Data : array([1, 1, 0, 0, 0, 0, 1, 1, 0,…