Python module providing a bridge between Scikit-Learn’s Machine Learning methods and pandas-style DataFrames
Questions tagged [sklearn-pandas]
1336 questions
4
votes
2 answers
Shuffling Multi Column in data frame
i have a Data frame like this :
'a' 'b' 'c' 'd' 'e' 'f'
'hello.text' 1 2 'hello2.text' 2 10
'hello3.text' 5 8 'hello4.text' 8 15
now i need shuffle or…

Mahdi Asiyabi
- 79
- 1
- 1
- 8
4
votes
1 answer
Pyspark Pandas_UDF erroring with Invalid argument, not a string or column
I created a Pandas UDF, which will input a dataframe, predict and output a dataframe on Primary_Key and Predictions.
schema = StructType([StructField('primary_id', IntegerType()),
StructField('prediction',…

Pawan Kalyan
- 51
- 5
4
votes
2 answers
How to encode a pandas.DataFrame column containing lists using Sklearn.preprocessing
I have a pandas df and some of the columns are lists with data in them and I would like to encode the labels within the lists.
I get this error:
ValueError: Expected 2D array, got 1D array instead:
from sklearn.preprocessing import…

raceee
- 477
- 5
- 14
4
votes
3 answers
how how iloc[:,1:] works ? can any one explain [:,1:] params
What is the meaning of below lines., especially confused about how iloc[:,1:] is working ? and also data[:,:1]
data = np.asarray(train_df_mv_norm.iloc[:,1:])
X, Y = data[:,1:],data[:,:1]
Here train_df_mv_norm is a dataframe --

Abhishek
- 1,543
- 3
- 13
- 29
4
votes
2 answers
Pandas - Counting rows in a df to discover the survival rate each day
.
Hello, guys!
I have a dfA (Table A) containing the amount of days that some products have been available (days_survived). I need to count the number of products that were available each day in total (Table B). I mean, I need counting rows in dfA…

Thaise
- 1,043
- 3
- 16
- 28
4
votes
2 answers
Too many _coef values for LogisticRegression in Pipeline
I'm making use of the sklearn-pandas DataFrameMapper in a sklearn Pipeline. In order to evaluate feature contribution in a feature union pipeline, I like to measure the coefficients of the estimator (Logistic Regression). For the following code…

Christopher
- 2,120
- 7
- 31
- 58
4
votes
1 answer
Text classification for logistic regression with pipelines
I am trying to use LogisticRegression for text classification. I am using FeatureUnion for the features of the DataFrame and then cross_val_score to test the accuracy of the classifier. However, I don't know how to include the feature with the free…

Paul K
- 123
- 7
4
votes
3 answers
statmodels OLS giving a TypeError in python
I am trying to fit a set of features to statsmodel's OLS linear regression model.
I am adding a few features at a time. With the first two features, it works fine. But when I keep adding new features it gives me an error.
Traceback (most recent call…

akalanka
- 553
- 7
- 21
4
votes
2 answers
stratified sample with replacement in python
I have a Pandas DataFrame. I am trying to create a sample DataFrame with replacement and also stratify it.
This allows me to replace:
df_test = df.sample(n=100, replace=True, random_state=42, axis=0)
However, I am not sure how to also stratify. …

pythonsandpandas
- 41
- 3
4
votes
1 answer
How to view cluster centroids for each iteration of n_init using skleans' KMeans
I am currently trying to view the created centroids(cluster centers) for each iteration of KMeans that is determined from each iteration of n_init. As of now I am able to view the final results but I would like to see these at each iteration so I am…

Tired_GradStudent
- 43
- 4
4
votes
1 answer
Linear fit to pandas.datetime64 values?
I have a dataframe with two columns (age, date) indicating the age of a person and the current date. I want to approximate the date of birth from that data. I thought to fit a linear model and find the interception with the, but it does not work out…

Soerendip
- 7,684
- 15
- 61
- 128
4
votes
1 answer
How to groupby() aggregate on multiple columns and rename the multi-index in Pandas 0.21+?
Code
import pandas as pd
df = pd.DataFrame({'A': [1, 1, 1, 2, 2],
'B': range(5),
'C': range(5)})
df1 = df.groupby('A').B.agg({'B': ['count','nunique'],'C': ['sum','median']})
df1.columns = ["_".join(x) for x…

GeorgeOfTheRF
- 8,244
- 23
- 57
- 80
4
votes
1 answer
python scipy spearman correlations
I am trying to obtain the column names from the dataframe (df) and associate them to the resulting array produced by the spearmanr correlation function. I need to associate both the column names (a-j) back to the correlation value (spearman) and…

Kyle
- 387
- 1
- 5
- 13
4
votes
1 answer
scikit-learn : ValueError: not enough values to unpack (expected 2, got 1)
There is a check_array function for calculating mean absolute percentage error (MAPE) in the recent version of sklearn but it doesn't seem to work the same way as the previous version.
import numpy as np
from sklearn.utils import check_array
def…

Desta Haileselassie Hagos
- 23,140
- 7
- 48
- 53
4
votes
2 answers
Constraint the sum of coefficients with scikit learn linear model
I am doing a LassoCV with 1000 coefs. Statsmodels did not seem to able to handle this many coefs. So I am using scikit learn. Statsmodel allowed for .fit_constrained("coef1 + coef2...=1"). This constrained the sum of the coefs to = 1. I need to do…

TChi
- 383
- 1
- 6
- 14