Questions tagged [scikit-learn]

Scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining, with a focus on machine learning. It is accessible to everybody and reusable in various contexts. It is built on NumPy and SciPy. The project is open source and commercially usable (BSD license).

scikit-learn is a machine-learning library for Python that provides simple and efficient tools for data analysis and data mining. It is built on NumPy, SciPy, and matplotlib. The project is open source and commercially usable (BSD license).

Resources

Related Libraries

  • sklearn-pandas - bridge library between scikit-learn and
  • scikit-image - scikit-learn-compatible API for image processing and computer vision for machine learning tasks
  • sklearn laboratory - scikit-learn wrapper that enables running larger scikit-learn experiments and feature sets
  • sklearn deap - scikit-learn wrapper that enables hyper parameter tuning using evolutionary algorithms instead of gridsearch in scikit-learn
  • hyperopt-sklearn - Hyper-parameter optimization for sklearn
  • scikit-plot - visualization library for quickly generating common plots in machine learning studies
  • sklearn-porter - library for turning trained scikit-learn models into compiled , , or code
  • sklearn_theano - scikit-learn-compatible objects (estimators, transformers, and datasets) using internally
  • sparkit-learn - scikit-learn API that uses 's distributed computing model
  • joblib - scikit-learn parallelization library
28024 questions
85
votes
5 answers

Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

I want to get feature names after I fit the pipeline. categorical_features = ['brand', 'category_name', 'sub_category'] categorical_transformer = Pipeline(steps=[ ('imputer', SimpleImputer(strategy='constant', fill_value='missing')), …
ResidentSleeper
  • 2,385
  • 2
  • 10
  • 20
85
votes
2 answers

What is the difference between pipeline and make_pipeline in scikit-learn?

I got this from the sklearn webpage: Pipeline: Pipeline of transforms with a final estimator Make_pipeline: Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor. But I still do not understand when I…
Aizzaac
  • 3,146
  • 8
  • 29
  • 61
85
votes
4 answers

classifiers in scikit-learn that handle nan/null

I was wondering if there are classifiers that handle nan/null values in scikit-learn. I thought random forest regressor handles this but I got an error when I call predict. X_train = np.array([[1, np.nan, 3],[np.nan, 5, 6]]) y_train = np.array([1,…
anthonybell
  • 5,790
  • 7
  • 42
  • 60
84
votes
3 answers

Different result with roc_auc_score() and auc()

I have trouble understanding the difference (if there is one) between roc_auc_score() and auc() in scikit-learn. Im tying to predict a binary output with imbalanced classes (around 1.5% for Y=1). Classifier model_logit =…
gowithefloww
  • 2,211
  • 2
  • 20
  • 31
83
votes
12 answers

Impute categorical missing values in scikit-learn

I've got pandas data with some columns of text type. There are some NaN values along with these text columns. What I'm trying to do is to impute those NaN's by sklearn.preprocessing.Imputer (replacing NaN by the most frequent value). The problem is…
night_bat
  • 3,212
  • 5
  • 16
  • 19
82
votes
7 answers

SKLearn warning "valid feature names" in version 1.0

I'm getting the following warning after upgrading to version 1.0 of scikit-learn: UserWarning: X does not have valid feature names, but IsolationForest was fitted with feature name I cannot find in the docs on what is a "valid feature name". How…
Jaume Figueras
  • 968
  • 1
  • 6
  • 6
82
votes
9 answers

The easiest way for getting feature names after running SelectKBest in Scikit Learn

I'm trying to conduct a supervised machine-learning experiment using the SelectKBest feature of scikit-learn, but I'm not sure how to create a new dataframe after finding the best features: Let's assume I would like to conduct the experiment…
Aviade
  • 2,057
  • 4
  • 27
  • 49
82
votes
5 answers

Use scikit-learn to classify into multiple categories

I'm trying to use one of scikit-learn's supervised learning methods to classify pieces of text into one or more categories. The predict function of all the algorithms I tried just returns one match. For example I have a piece of text: "Theaters in…
CodeMonkeyB
  • 2,970
  • 4
  • 22
  • 29
81
votes
6 answers

Scikit Learn SVC decision_function and predict

I'm trying to understand the relationship between decision_function and predict, which are instance methods of SVC (http://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). So far I've gathered that decision function returns pairwise…
Peter Tseng
  • 1,294
  • 1
  • 12
  • 15
80
votes
6 answers

Can sklearn random forest directly handle categorical features?

Say I have a categorical feature, color, which takes the values ['red', 'blue', 'green', 'orange'], and I want to use it to predict something in a random forest. If I one-hot encode it (i.e. I change it to four dummy variables), how do I tell…
tkunk
  • 1,378
  • 1
  • 13
  • 19
80
votes
11 answers

Principal Component Analysis (PCA) in Python

I have a (26424 x 144) array and I want to perform PCA over it using Python. However, there is no particular place on the web that explains about how to achieve this task (There are some sites which just do PCA according to their own - there is no…
khan
  • 7,005
  • 15
  • 48
  • 70
78
votes
3 answers

difference between StratifiedKFold and StratifiedShuffleSplit in sklearn

As from the title I am wondering what is the difference between StratifiedKFold with the parameter shuffle=True StratifiedKFold(n_splits=10, shuffle=True, random_state=0) and StratifiedShuffleSplit StratifiedShuffleSplit(n_splits=10,…
78
votes
5 answers

sklearn Logistic Regression "ValueError: Found array with dim 3. Estimator expected <= 2."

I attempt to solve this problem 6 in this notebook. The question is to train a simple model on this data using 50, 100, 1000 and 5000 training samples by using the LogisticRegression model from sklearn.linear_model. lr =…
edwin
  • 1,152
  • 1
  • 13
  • 27
78
votes
6 answers

Mixing categorial and continuous data in Naive Bayes classifier using scikit-learn

I'm using scikit-learn in Python to develop a classification algorithm to predict the gender of certain customers. Amongst others, I want to use the Naive Bayes classifier but my problem is that I have a mix of categorical data (ex: "Registered…
75
votes
3 answers

Feature/Variable importance after a PCA analysis

I have performed a PCA analysis over my original dataset and from the compressed dataset transformed by the PCA I have also selected the number of PC I want to keep (they explain almost the 94% of the variance). Now I am struggling with the…
fbm
  • 753
  • 1
  • 6
  • 5