0

I have the following dataset of dim 22784 X 18

head(MS.DATA.IN.NUM.ZeroVar)
  X    x1        x2        x3        x4        x5        x6        x7        x8        x9
1 1 15512 0.4608690 0.0492522 0.2264698 0.1498266 0.7528365 0.0100567 0.5797286 0.0032513
2 2  1550 0.4709677 0.0025806 0.1374194 0.0963415 0.8625806 0.0000000 0.6951424 0.0050251
3 3  4741 0.4853406 0.0002109 0.1894115 0.1356557 0.8569922 0.0000000 0.6835836 0.0041429
4 4   467 0.4989293 0.0000000 0.1006424 0.0854701 0.9079229 0.0000000 0.7804878 0.0060976
5 5   310 0.4741935 0.6806452 0.2258065 0.1288344 0.8967742 0.0000000 0.7563025 0.0084034
6 6   461 0.4750542 0.0867679 0.1301518 0.0950413 0.9240781 0.0000000 0.7926829 0.0000000
        x10       x11       x12       x13       x14       x15       x16    x17
1 0.0759118 0.6253178 0.0366129 0.9913769 0.2601165 0.0522456 0.7740586 130600
2 0.0435511 0.0642633 0.0033501 0.9949749 0.2852665 0.0606061 0.1428571  40500
3 0.0279648 0.0657958 0.0000000 0.9974107 0.3154330 0.0651163 0.6875000  28700
4 0.0182927 0.0574713 0.0000000 1.0000000 0.1494253 0.1395349 1.0000000  28500
5 0.0168067 0.0775194 0.6722689 0.9915966 0.1472868 0.0000000 0.0000000  24100
6 0.0060976 0.0888889 0.0548780 0.9939024 0.2722222 0.2941176 0.5000000  14999

I just want some basic sampling idea based on dataset size (instances/records) criteria:

What i would like to do create a function wherein

1:i set a size threshold say 10000.So suppose the dataset<=10000 rows then the dataset for analysis is taken full(population).

2:But in case the size>10000 & <50000, then the dataset is sampled to a size say=15000 rows....

3:If the size >50000 then the sample size be should be curtailed to 20000

I presume if..else condition will be needed...can it be done using apply family & dplyr functions.............

Nishant
  • 1,063
  • 13
  • 40

2 Answers2

1

I think cut will be helpful here in determining the group and then sampling an appropriate number of rows:

# example data:
dat <- data.frame(row=seq_len(10000),id=seq_len(10000))
# sample away!
dat[sample(seq_len(nrow(dat)), c(nrow(dat),1.5e4,2e4)[cut(nrow(dat), c(0,1e4,5e4,Inf))]),]
thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • thr is prob here......this gives an error for say if the dataset size is 12000. then from the code it would imply that sample would be 15000 but sample size cannot be greater than pop size..... – Nishant Jan 27 '17 at 09:04
  • sample_data<-data[sample(seq_len(nrow(data)), c(nrow(data),ifelse(nrow(data)<1.5e4,nrow(data),1.5e4),2e4)[cut(nrow(data), c(0,1e4,5e4,Inf))]),] This should now work.... – Nishant Jan 27 '17 at 09:09
  • @nishant - I don't think the ifelse is necessary. You probably just need to adjust the cutoffs in the cut call so they align with the relative sizes of your samples. – thelatemail Jan 27 '17 at 09:15
0

This is my favorite way of generally splitting a dataset.

spec<-c(train=0.7, test=0.3)
division <- function(df,spec) sample(cut(seq(nrow(df)), nrow(df) * cumsum(c(0, spec)), labels=names(spec) ))
dat<- split(MS.DATA.IN.NUM.ZeroVar, division(MS.DATA.IN.NUM.ZeroVar, spec))

And then you can access the sets with dat$train and dat$test

In this case you would just set your spec to ifelse(nrow(MS.DATA.IN.NUM.ZeroVar)<=10000, 1, ifelse(nrow(MS.DATA.IN.NUM.ZeroVar) > 50000, 0.4, 0.3))

Jean
  • 1,480
  • 15
  • 27
  • gr8 way to sample...only issue here is that since ur are using prob.....say for size=30000 i will be getting 18000 as sample size(prob =0.3) but requirement was to get a fixed size of 15000 for anything in between 10000 to 50000...and so on....just needs a small tweak i suppose – Nishant Jan 27 '17 at 06:13