Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames
Questions tagged [sklearn-pandas]
1336 questions
13
votes
3 answers
Why shouldn't the sklearn LabelEncoder be used to encode input data?
The docs for sklearn.LabelEncoder start with
This transformer should be used to encode target values, i.e. y, and not the input X.
Why is this?
I post just one example of this recommendation being ignored in practice, although there seems to be…

hlud6646
- 399
- 2
- 10
13
votes
3 answers
difference between LinearRegression and svm.SVR(kernel="linear")
First there are questions on this forum very similar to this one but trust me none matches so no duplicating please.
I have encountered two methods of linear regression using scikit's sklearn and I am failing to understand the difference between the…

Dev_Man
- 847
- 1
- 10
- 28
13
votes
4 answers
feature_names must be unique - Xgboost
I am running the xgboost model for a very sparse matrix.
I am getting this error. ValueError: feature_names must be unique
How can I deal with this?
This is my code.
yprob = bst.predict(xgb.DMatrix(test_df))[:,1]

user2728024
- 1,496
- 8
- 23
- 39
13
votes
2 answers
How to do Onehotencoding in Sklearn Pipeline
I am trying to oneHotEncode the categorical variables of my Pandas dataframe, which includes both categorical and continues variables. I realise this can be done easily with the pandas .get_dummies() function, but I need to use a pipeline so I can…

Desiré De Waele
- 152
- 1
- 1
- 10
13
votes
1 answer
Adding pandas columns to a sparse matrix
I have additional derived values for X variables that I want to use in my model.
XAll = pd_data[['title','wordcount','sumscores','length']]
y = pd_data['sentiment']
X_train, X_test, y_train, y_test = train_test_split(XAll, y, random_state=1)
As I…

Bonson
- 1,418
- 4
- 18
- 38
13
votes
1 answer
How to change particular column value when defined mask is true?
I have a dataframe in which I have these column names
'team1',
'team2',
'city',
'date'.
What I want to do is to assign value of 'city' as 'dubai' when certain condition meets(which I am defining using mask).
This is what I am doing exactly:
…

Pankaj Mishra
- 550
- 6
- 18
12
votes
4 answers
Standardize some columns in Python Pandas dataframe?
Python code below only return me an array, but I want the scaled data to replace the original data.
from sklearn.preprocessing import StandardScaler
df = StandardScaler().fit_transform(df[['cost', 'sales']])
df
output
array([[ 1.99987622,…

BigData
- 397
- 2
- 3
- 13
12
votes
3 answers
Sklearn error : predict(x,y) takes 2 positional arguments but 3 were given
I am working on building a multivariate regression analysis on sklearn , I did a thorough look at the documentation. When I run the predict() function I get the error : predict() takes 2 positional arguments but 3 were given
X is a data frame , y…

GD_N
- 153
- 1
- 2
- 13
12
votes
1 answer
What's the difference between sklearn Pipeline and DataFrameMapper?
Sklearn Pipeline: http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
DataFrameMapper: https://github.com/paulgb/sklearn-pandas
What's the difference between them?
It seems to me that sklearn pipeline has more features,…

nkhuyu
- 840
- 3
- 9
- 23
12
votes
3 answers
HOW TO LABEL the FEATURE IMPORTANCE with forests of trees?
I use sklearn to plot the feature importance for forests of trees. The dataframe is named 'heart'. Here the code to extract the list of the sorted features:
importances = extc.feature_importances_
indices =…

ElenaPhys
- 443
- 2
- 5
- 16
11
votes
2 answers
Appending arrays to dataframe (python)
So I ran a time series model on a small sales data set, and forecasted sales for next 12 periods. With the following code:
mod1=ARIMA(df1, order=(2,1,1)).fit(disp=0,transparams=True)
y_future=mod1.forecast(steps=12)[0]
where df1 contains the…

IndigoChild
- 842
- 3
- 11
- 29
11
votes
1 answer
Difference between model score() vs r2_score
I am training a LinearRegression() classifier and trying to gauge its prediction accruacy
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
regr_rf = LinearRegression()
regr_rf.fit(df[features],df['label'])
y_rf…

David
- 4,634
- 7
- 35
- 42
11
votes
2 answers
how to search a string value within a specific column in pandas dataframe, and if present, give an output of that row present in the dataframe?
I wish to search a database that I have in a .pkl file.
I have loaded the .pkl file and stored it in a variable named load_data.
Now, I need to accept a string input using raw input and search for the string in one specific column 'SMILES' of my…

Devarshi Sengupta
- 121
- 1
- 1
- 4
11
votes
8 answers
Error when trying to import sklearn modules : ImportError: DLL load failed: The specified module could not be found
I tried to do the following importations for a machine learning project:
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
I got this error message:
Traceback (most recent call last):
File…

Taha Abdelhalim Nakabi
- 123
- 1
- 1
- 6
11
votes
4 answers
GridSearchCV: "TypeError: 'StratifiedKFold' object is not iterable"
I want to perform GridSearchCV in a RandomForestClassifier, but data is not balanced, so I use StratifiedKFold:
from sklearn.model_selection import StratifiedKFold
from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import…

user183897
- 111
- 1
- 1
- 4