1

I have a very large folder of images (train_dir), as well as a CSV file containing the class labels for each of those images(train_df). Because the data is huge, I'd like to take only a sample of images (say 25%) along with labels(train_df); How would I be doing this in R Programming?

My "train_dir" folder has around 150,000 images = ('1.png','2.png',....) and my CSV file looks something similar to CSV file - train_df

What would be the approach to go about making r-script that can do this?

1 Answers1

0

Something along the lines of the following code will

  1. get an subset of the row numbers, to serve as an index into train_df;
  2. Subset train_df, and get a sample of PNG filenames. Since column "id" is a factor, convert it to character.
  3. To each filename, apply a read PNG function. In this case I have used png::readPNG, but others can be used in the same way.

The code then becomes the following.

perc  <- 0.25
n <- nrow(train_df)
i <- sample(n, n*perc)
png_filenames <- as.character(train_df[i, "id"])

png_files <- lapply(png_filenames, function(x){
  png::readPNG(x, native = TRUE)
})
Rui Barradas
  • 70,273
  • 8
  • 34
  • 66
  • Thank you for your time but that's not exactly what I need. I just need to sample 1000 images to work with from the total 150,000 images in my 'train_dir' folder. – Horseman1901 May 02 '20 at 17:05
  • @Horseman1901 That's even simpler, instead of `perc<-0.25` do `m <- 1000` and use it where the code has `n*perc`. – Rui Barradas May 02 '20 at 17:40