0

I want to split train and test but with choose() function not with sample() in R.

I have 58 rows and 28 columns on my dataset (a csv file ) and I want to do a 10-fold or 5-fold CV on this dataset.

How am I going to write the code down for this task ?

I`ve tried:

set.seed(1)
smp_size=choose(58,5, name_dataset) # which is totally wrong but ... 
# I haven't figured out yet how to take 5 subsets from 58 observations
# each time I do a 5/10 -fold  CV

train_ind=sample(seq_len(nrow(name_dataset)),size=smp_size) # I think sample here is wrong too
train=name_dataset[train_ind,]
test=name_dataset[-train_ind,]
jpmarinier
  • 4,427
  • 1
  • 10
  • 23
pkr
  • 3
  • 1
  • Why do you want to use the `choose()` function rather than `sample()`? It might help to answer the question if we understand the rationale – rw2 Mar 15 '22 at 12:06
  • I think that choose function is taking every possible 5-subset from 58 obs each time and that`s why I need it, I want EVERY possible combination of 5 subs. sample() is simply choosing only 5 out of 58 obs. I may be wrong on this and maybe I have simply understood wrong what each function is doing, I am new to R, sorry! – pkr Mar 15 '22 at 12:15
  • 3
    I think I understand. But there are more than 4.5 million possible combinations of 5 samples from 58. Surely you don't want to test all of them? – rw2 Mar 15 '22 at 12:40

1 Answers1

2

I don't know what you mean by every possible combination of 5-subset. That seems like an incredibly large amount of possibilities. I assume you mean that you want a subset of 5 datasets that contain all of the samples in your dataset. I would probably do something like this. We first make a vector of groups that is the number of k and the length of the dataset. We then sample the groups randomly and split the dataset by these groupings.

library(tidyverse)

set.seed(3465)
test_data <- tibble(A = runif(58),
                    B = runif(58))


k_split <- function(dat,k, seed = 1){
  set.seed(seed)
  grp <- rep(1:k, length.out = nrow(dat))
  dat |>
    mutate(grp = sample(grp, nrow(dat), replace = F)) |>
    group_split(grp)|>
    map(\(d) select(d, -grp))
}

k_split(test_data, 5)
#> [[1]]
#> # A tibble: 12 x 2
#>        A      B
#>    <dbl>  <dbl>
#>  1 0.476 0.468 
#>  2 0.636 0.639 
#>  3 0.334 0.0269
#>  4 0.668 0.220 
#>  5 0.398 0.919 
#>  6 0.343 0.748 
#>  7 0.799 0.526 
#>  8 0.710 0.759 
#>  9 0.737 0.927 
#> 10 0.819 0.441 
#> 11 0.852 0.656 
#> 12 0.416 0.541 
#> 
#> [[2]]
#> # A tibble: 12 x 2
#>         A      B
#>     <dbl>  <dbl>
#>  1 0.0107 0.905 
#>  2 0.109  0.539 
#>  3 0.715  0.778 
#>  4 0.523  0.416 
#>  5 0.609  0.357 
#>  6 0.152  0.0972
#>  7 0.919  0.450 
#>  8 0.866  0.510 
#>  9 0.0347 0.0890
#> 10 0.862  0.465 
#> 11 0.364  0.765 
#> 12 0.789  0.601 
#> 
#> [[3]]
#> # A tibble: 12 x 2
#>         A      B
#>     <dbl>  <dbl>
#>  1 0.580  0.228 
#>  2 0.201  0.0418
#>  3 0.0359 0.417 
#>  4 0.521  0.758 
#>  5 0.534  0.974 
#>  6 0.580  0.563 
#>  7 0.844  0.781 
#>  8 0.756  0.271 
#>  9 0.211  0.533 
#> 10 0.851  0.764 
#> 11 0.885  0.150 
#> 12 0.262  0.371 
#> 
#> [[4]]
#> # A tibble: 11 x 2
#>         A     B
#>     <dbl> <dbl>
#>  1 0.556  0.313
#>  2 0.353  0.821
#>  3 0.0959 0.861
#>  4 0.759  0.261
#>  5 0.207  0.772
#>  6 0.668  0.527
#>  7 0.150  0.788
#>  8 0.0939 0.257
#>  9 0.0913 0.817
#> 10 0.294  0.790
#> 11 0.0224 0.253
#> 
#> [[5]]
#> # A tibble: 11 x 2
#>          A      B
#>      <dbl>  <dbl>
#>  1 0.0893  0.665 
#>  2 0.966   0.142 
#>  3 0.672   0.0849
#>  4 0.641   0.155 
#>  5 0.490   0.187 
#>  6 0.00394 0.295 
#>  7 0.126   0.813 
#>  8 0.202   0.474 
#>  9 0.0740  0.107 
#> 10 0.412   0.709 
#> 11 0.509   0.253
AndS.
  • 7,748
  • 2
  • 12
  • 17
  • It's safe to say that I can`t understand anything from this code. I am new to this and I'm trying my best, but this, this is super confusing, I will just keep looking for my answer or I'll find something close to what I`m looking for. Thank you for you time!! – pkr Mar 15 '22 at 12:55
  • 1
    What is confusing about this to you? I would be happy to explain. I defined a function to split a dataset into k-folds. Is that not what you are looking for? – AndS. Mar 15 '22 at 13:04