0

I am a new user to R and am trying to create multiple subsamples of a data frame. I have my data assigned to 4 stratum (STRATUM = 1, 2, 3, 4), and want to randomly keep only a specified number of rows in each stratum. To achieve this, I import my data, sort by the stratification value, then assign a random number to each row. I want to keep my original random number assignments since I need to use them again in future analyses, so I save a .csv with these values. Next, I subset the data by their stratum, and then specify the number of records that I want to retain in each stratum. Finally, I rejoin the data and save as a new .csv. The code works, however, I want to repeat this process 100 times. In each case I want to save the .csv with random numbers assigned, as well as the final .csv of randomly selected plots. I am unsure of how to get this block of code to repeat 100x, and also how to assign a unique file name for each iteration. Any help would be much appreciated.

DataFiles <- "//Documents/flownData_JR.csv"
PlotsFlown <- read.table (file = DataFiles, header = TRUE, sep = ",")
#Sort the data by the stratification
FlownStratSort <- PlotsFlown[order(PlotsFlown$STRATUM),]
#Create a new column with a random number (no duplicates)
FlownStratSort$RAND_NUM <- sample(137, size = nrow(FlownStratSort), replace = FALSE)
#Sort by the stratum, then random number
FLOWNRAND <- FlownStratSort[order(FlownStratSort$STRATUM,FlownStratSort$RAND_NUM),]
#Save a csv file with the random numbers
write.table(FLOWNRAND, file = "//Documents/RANDNUM1_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)
#Subset the data by stratum
FLOWNRAND1 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='1'),]
FLOWNRAND2 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='2'),]
FLOWNRAND3 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='3'),]
FLOWNRAND4 <- FLOWNRAND[which(FLOWNRAND$STRATUM=='4'),]
#Remove data from each stratum, specifying the number of records we want to retain
FLOWNRAND1 <- FLOWNRAND1[1:34, ]
FLOWNRAND2 <- FLOWNRAND2[1:21, ]
FLOWNRAND3 <- FLOWNRAND3[1:7, ]
FLOWNRAND4 <- FLOWNRAND4[1:7, ]
#Rejoin the data
FLOWNRAND_uneven <- rbind(FLOWNRAND1, FLOWNRAND2, FLOWNRAND3, FLOWNRAND4)
#Save the table with plots removed from each stratum flown in 2017
write.table(FLOWNRAND_uneven, file = "//Documents/Flown_RAND_uneven_JR.csv", sep = ",", row.names = FALSE, col.names = TRUE)

1 Answers1

0

Here's a data.table solution if you just need to know which rows are in each set.

library(data.table)
df <- data.table(dat = runif(100), 
                 stratum = sample(1:4, 100, replace = T))

# Gets specified number randomly from each strata
get_strata <- function(df, n, i){
  # Subset data frame to randomly chosen w/in strata
  # replace stratum with var name
  f <- df[df[, .I[sample(.N, n)], by = stratum]$V1]

  # Save as CSV, replace path
  write.csv(f, file = paste0("path/df_", i), 
            row.names = F, col.names = T)
}

for (i in 1:100){
  # replace 10 with number needed
  get_strata(df, 10, i)
}
Alex Gold
  • 335
  • 1
  • 9
  • As an end result I want a .csv that has all of my original data columns, but only 34 rows from stratum 1, 21 rows from stratum 2, 7 rows from stratum 3, and 7 rows from stratum 4. I want these rows within each stratum to be randomly selected, so that with repetition I am getting a different subset of rows in the stratum each time. I want to repeat this process 100x, generating 100 .csv files. – Jennifer Rodgers Jun 02 '17 at 15:41