Questions tagged [cross-validation]

Cross-Validation is a method of evaluating and comparing predictive systems in statistics and machine learning.

Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.

In typical cross-validation, the training and validation sets must cross-over in successive rounds such that each data point has a chance of being validated against. The basic form of cross-validation is k-fold cross-validation.

Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.

2604 questions
23
votes
5 answers

Cross Validation in Keras

I'm implementing a Multilayer Perceptron in Keras and using scikit-learn to perform cross-validation. For this, I was inspired by the code found in the issue Cross Validation in Keras from sklearn.cross_validation import StratifiedKFold def…
23
votes
3 answers

Cross-validation in LightGBM

How are we supposed to use the dictionary output from lightgbm.cv to improve our predictions? Here's an example - we train our cv model using the code below: cv_mod = lgb.cv(params, d_train, 500, …
Nlind
  • 331
  • 1
  • 3
  • 4
23
votes
4 answers

Sklearn preprocessing - PolynomialFeatures - How to keep column names/headers of the output array / dataframe

TLDR: How to get headers for the output numpy array from the sklearn.preprocessing.PolynomialFeatures() function? Let's say I have the following code... import pandas as pd import numpy as np from sklearn import preprocessing as pp a =…
Afflatus
  • 2,302
  • 5
  • 25
  • 40
22
votes
1 answer

Using sklearn cross_val_score and kfolds to fit and help predict model

I'm trying to understand using kfolds cross validation from the sklearn python module. I understand the basic flow: instantiate a model e.g. model = LogisticRegression() fitting the model e.g. model.fit(xtrain, ytrain) predicting e.g.…
hselbie
  • 1,749
  • 9
  • 24
  • 40
22
votes
4 answers

understanding python xgboost cv

I would like to use the xgboost cv function to find the best parameters for my training data set. I am confused by the api. How do I find the best parameter? Is this similar to the sklearn grid_search cross-validation function? How can I find which…
kilojoules
  • 9,768
  • 18
  • 77
  • 149
22
votes
2 answers

How to cross validate RandomForest model?

I want to evaluate a random forest being trained on some data. Is there any utility in Apache Spark to do the same or do I have to perform cross validation manually?
21
votes
2 answers

scikit-learn GridSearchCV with multiple repetitions

I'm trying to get the best set of parameters for an SVR model. I'd like to use the GridSearchCV over different values of C. However, from the previous test, I noticed that the split into the Training/Test set highly influences the overall…
Titus Pullo
  • 3,751
  • 15
  • 45
  • 65
20
votes
3 answers

What does KFold in python exactly do?

I am looking at this tutorial: https://www.dataquest.io/mission/74/getting-started-with-kaggle I got to part 9, making predictions. In there there is some data in a dataframe called titanic, which is then divided up in folds using: # Generate cross…
user
  • 2,015
  • 6
  • 22
  • 39
19
votes
1 answer

(Python - sklearn) How to pass parameters to the customize ModelTransformer class by gridsearchcv

Below is my pipeline and it seems that I can't pass the parameters to my models by using the ModelTransformer class, which I take it from the link (http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html) The error message…
18
votes
2 answers

Cross-validation for grouped time-series (panel) data

I work with panel data: I observe a number of units (e.g. people) over time; for each unit, I have records for the same fixed time intervals. When splitting the data into train and test sets, we need to make sure that both sets are disjoint and…
18
votes
2 answers

Cross-validation in sklearn: do I need to call fit() as well as cross_val_score()?

I would like to use k-fold cross validation while learning a model. So far I am doing it like this: # splitting dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(dataset_1, df1['label'], test_size=0.25,…
torayeff
  • 9,296
  • 19
  • 69
  • 103
18
votes
1 answer

ValueError: n_splits=10 cannot be greater than the number of members in each class

I am trying to run the following code: from sklearn.model_selection import StratifiedKFold X = ["hey", "join now", "hello", "join today", "join us now", "not today", "join this trial", " hey hey", " no", "hola", "bye", "join today", "no","join…
SFC
  • 733
  • 2
  • 11
  • 22
18
votes
1 answer

sklearn cross_val_score gives lower accuracy than manual cross validation

I'm working on a text classification problem, which I've set up like so (I've left out the data processing steps for concision, but they'll produce a dataframe called data with columns X and y): import sklearn.model_selection as ms from…
Empiromancer
  • 3,778
  • 1
  • 22
  • 53
17
votes
2 answers

Sklearn custom transformers: difference between using FunctionTransformer and subclassing TransformerMixin

In order to do proper CV it is advisable to use pipelines so that same transformations can be applied to each fold in the CV. I can define custom transformations by using either sklearn.preprocessing.FunctionTrasformer or by subclassing…
artemis
  • 581
  • 1
  • 4
  • 13
17
votes
2 answers

How to compute precision,recall and f1 score of an imbalanced dataset for K fold cross validation?

I have an imbalanced dataset containing a binary classification problem. I have built Random Forest Classifier and used k-fold cross-validation with 10 folds. kfold = model_selection.KFold(n_splits=10,…