1

I am looking for a robust way to partition a dataset without using the sample() function, and hope to get some feedback.

As a matter of fact, I'd ideally like to get rid of the of random property inherent to the usage of sample()

samp<-data.frame(qldat)   # convert zoo time-series object to data.frame
ind <- sample(2,nrow(samp),replace = TRUE, prob=c(0.8,0.2))  # splitting
#data series between training and test sets
tsamp<- samp[ind==1,]  # training dataset
vsamp<- samp[ind==2,]  # test set

Following some researches, I've figured out that subset() could have helped, but it could involve a bit of hard-codingthe dataset. By hard-coding I mean for a 80:20 split(%) using nrow(samp), It's possible to subset the data from row=1 to row= 0.8 * nrow(samp) for instance, acknowledging that it might not be a very efficient solution.

I've also tried createDataPartition(), but it did not match my expectation since samp does not hold any categorical data on which I could rely on for the split (e.g createDataPartition(y=samp$categoricaldata,p=0.8, list=FALSE)

PS: What I like in ind<- is the inclusion of prob=c(0.8,0.2), thus the slice is sorted out automatically. Hence any similar idea without randomly splitting tsamp && vsamp would be very appreciated.

Best,

owner
  • 723
  • 3
  • 9
  • 25
  • 2
    If not random, how do you want to determine which part of the data is training and test? Do you always want to do this by the sequence in which rows appear in the data? What happens if a 'clear' split cannot be done (like probs of 0.2 and 0.8 and 101 observations). – Heroka Dec 20 '15 at 18:27
  • @Heroka: by using `nrow()`. If a particular column is indexed from let's say `[1:10]` then consider the first split from `[round(1*80%):8]` for `tsamp`. hope it helps. – owner Dec 20 '15 at 18:44

1 Answers1

0

Is this what you are looking for?

n <- nrow(samp)
train_i <- 1:round(0.8*n)
test_i <- round(0.8*n+1):n
train <- samp[train_i,]
test <- samp[test_i,]
mtoto
  • 23,919
  • 4
  • 58
  • 71
  • Thanks, as you've definitely got my point. Your suggestion does work and I will accept it as answer, although -as indicated in the thread- I would not like to hard-code somehow the dataset using `(0.8* n)` for instance. Maybe there is no other alternative... – owner Dec 20 '15 at 19:05
  • do you want the split proportions to be random or what exactly are you looking for? – mtoto Dec 20 '15 at 19:19
  • so far, I'm happy with the static weights, but just thought there could be a different way to replicate it within a single statement like in `ind<-`. – owner Dec 20 '15 at 19:28
  • Can you describe a setting in which the above approach doesnt work? – mtoto Dec 20 '15 at 19:34
  • "All roads lead[ing] to rome", your approach works, no worries. Mine edited in the post via `<-ind` works too, and I had a slight preference towards the inclusion of the weight vector. Unfortunately and as expected, my training and test sets were randomly populated due to `sample()` and I had been looking for a similar syntax (if existing) without `sample()`. Hope it's clear. – owner Dec 20 '15 at 20:19
  • Won't this include one sample in both sets? – Heroka Dec 20 '15 at 22:41
  • @Heroka: see `tsamp<-` && `vsamp` as complements. – owner Dec 21 '15 at 06:44