Logistic regression training and test data

Question

I am a beginner to R and am having trouble with something that feels basic but I am not sure how to do it. I have a data set with 1319 rows and I want to setup training data for observations 1 to 1000 and the test data for 1001 to 1319.

Comparing with notes from my class and the professor set this up by doing a Boolean vector by the 'Year' variable in her data. For example:

train=(Year<2005)

And that returns the True/False statements.

I understand that and would be able to setup a Boolean vector if I was subsetting my data by a variable but instead I have to strictly by the number of rows which I do not understand how to accomplish. I tried

train=(data$nrow < 1001)

But got logical(0) as a result.

Can anyone lead me in the right direction?

fmarm · Answer 1 · 2019-10-30T01:13:11.120

0

You get logical(0) because nrow is not a column

You can also subset your dataframe by using row numbers

train = 1:1000 # vector with integers from 1 to 1000
test = 1001:nrow(data)
train_data = data[train,]
test_data = data[test,]

But be careful, unless the order of rows in your dataframe is completely random, you probably want to get 1000 rows randomly and not the 1000 first ones, you can do this using

train = sample(1:nrow(data),1000)

You can then get your train_data and test_data using

train_data = data[train,]
test_data = data[setdiff(1:nrow(data),train),]

The setdiff function is used to get all rows not selected in train

edited Oct 30 '19 at 01:13

answered Oct 30 '19 at 00:37

fmarm

4,209
1
17
29

How would I setup the test data then using [!train,] for the second part? Because this just subsets the data rather than setup the training/testing sets conditionally on first 1000 rows or not – Oct 30 '19 at 00:53

Nick · Answer 2 · 2019-10-30T01:27:52.087

The issue with splitting your data set by rows is the potential to introduce bias into your training and testing set - particularly for ordered data.

# Create a data set
data <- data.frame(year = sample(seq(2000, 2019, by = 1), 1000, replace = T),
                   data = sample(seq(0, 1, by = 0.01), 1000, replace = T))

nrow(data)
[1] 1000

If you really want to take the first n rows then you can try:

first.n.rows <- data[1:1000, ]

The caret package provides a more reliable approach to using cross validation in your models.

First create the partition rule:

library(caret)
inTrain <- createDataPartition(y = data$year,
                           p = 0.8, list = FALSE)

Note y = data$year this tells R to use the variable year to sample from, ensuring you don't get ordered data and introduced bias to the model.

The p argument tells caret how much of the original data should be partitioned to the training set, in this case 80%.

Then apply the partition to the data set:

# Create the training set
train <- data[inTrain,]

# Create the testing set
test <- data[-inTrain,]

nrow(train) + nrow(test)
[1] 1000

Hey so this is for a class assignment where the instructions say to use first 1000 rows and then the rest for the testing data. Also there is no year variable, I was just trying to explain how the professor accomplished this with her own data set example. — , Oct 30 '19 at 01:14

Logistic regression training and test data

2 Answers2