Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames
Questions tagged [sklearn-pandas]
1336 questions
2
votes
1 answer
Accessing attributes in sklearn pipeline
I'm having trouble accessing attributes of intermediate steps in my sklearn pipeline. Here's my code:
from sklearn.pipeline import make_pipeline, make_union
from sklearn.compose import make_column_transformer
from sklearn.impute import…

mrgoldtech
- 73
- 1
- 4
2
votes
0 answers
Multinomial naive bayes ValueError: shapes not aligned, only when using chi2 test
I'm trying to make a pos/neg review classifier and wanted to use Multinomial naive bayes (or regular naive bayes). If I don't feature select using SelectKbest Chi2, it works fine. But if I do, I get the following error:
Traceback (most recent call…

user12195705
- 147
- 2
- 10
2
votes
1 answer
Error Making prediction with python onnxruntime
I have created an very basic decision tree using the sklearn library. This tree is trained based on 4 features:
feat1 INT
feat2 INT
feat3 FLOAT
feat4 FLOAT
And the label/target feature is a boolean value (0 or 1).
I converted the tree into a ONNX…

user7432713
- 197
- 3
- 17
2
votes
1 answer
How to choose data columns and target columns in a dataframe for test_train_split?
I'm trying to set up a test_train_split with data I have read from a csv into a pandas dataframe. The book I am reading says I should separate into x_train as the data and y_train as the target, but how can I define which column is the target and…

James
- 395
- 2
- 8
- 16
2
votes
2 answers
Get prediction confidence through Decision Tree Regression in sklearn
Is there a way I can attach some sort of confidence with my predictions from Decision Tree Regression output in python?
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=0, criterion="mae")
dt_fit =…

ayadav
- 75
- 8
2
votes
1 answer
"A column-vector y was passed when a 1d array was expected" error message
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis()
clf.fit(np.matrix(X_train), np.matrix(y_train))
but I get the error message. Specified above.
I checked the shape of y_train but it's…

financial_physician
- 1,672
- 1
- 14
- 34
2
votes
2 answers
Pandas: get the cumulative sum of a column only if the timestamp is greater than that of another column
For each customer, I would like to get the cumulative sum of a column (Dollar Value) only when Timestamp 1 is less than Timestamp 2. I could do a cartesian join of the values based on Customer or iterate through the dataframe, but wanted to see if…

minnymate
- 55
- 6
2
votes
2 answers
Optimize K-Nearest Neighbors Algorithm on 50 variables x 100k row dataset
I want to optimize a piece of code that helps me to calculate a nearest neighbour for every item in a given dataset with 100k rows. The dataset contains 50 variable-columns, which helps to describe each row-item and most of cells contains a…

d_-
- 1,391
- 2
- 19
- 37
2
votes
2 answers
Pandas Dataframe apply custom function to certain rows with NULL columns
I have a Dataframe that looks like:
------------------------------
|Date | Deal | Country |
------------------------------
|2019-01-02 | ABC | US |
------------------------------
|2019-02-01 | ABC | US …

CodeSsscala
- 729
- 3
- 11
- 23
2
votes
4 answers
Whats does X of imputer = imputer.fit(X[:,1:3]) stand for, whats the meaning of imputer.fit(X[:,1:3])?
I m working on a preprocessing a data set, i get the error cause of the line
imputer = imputer.fit(X[:,1:3]). Which i dont get? I understand imputer = Imputer(missing_values = "NaN", strategy = "mean"), means replace missing values with mean value…

Dulangi_Kanchana
- 1,135
- 10
- 21
2
votes
1 answer
NameError : name 'metrics' is not defined
It gives error in calculating accuracy of metrics. I imported the library to calculate accuracy metrics but it still giving me error that metrics name is not defined
from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect =…

Hafsa Naveed
- 33
- 1
- 1
- 4
2
votes
1 answer
Split list into columns in pandas
I have a dataframe like this
df = (pd.DataFrame({'ID': ['ID1', 'ID2', 'ID3'],
'Values': [['AB', 'BC'], np.NaN, ['AB', 'CD']]}))
df
ID Values
0 ID1 [AB, BC]
1 ID2 NaN
2 ID3 [AB, CD]
I want to split the item inside…

Hardik Gupta
- 4,700
- 9
- 41
- 83
2
votes
1 answer
How to fix Value Error with train_test_split in Python Numpy
I am using sklearn with a numpy array.
I have 2 arrays (x, y) and they should be:
test_size=0.2
train_size=0.8
This is my current code:
def predict():
sample_data = pd.read_csv("includes\\csv.csv")
x = np.array(sample_data["day"])
y =…

python_beginner
- 105
- 2
- 4
- 12
2
votes
2 answers
After choosing K-components in PCA how do we find out which components(names of the columns) have algorithm selected?
I am new to Data Science and I need some help to understand PCA.I know that each of columns constitute one axis,but when PCA is done and components are reduced to some k value,How to know which all columns got selected?

Ravi Biradar
- 61
- 3
- 7
2
votes
1 answer
How to load this kind of data in pandas
Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.
DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019…

NotAgain
- 1,927
- 3
- 26
- 42