0

I am working with trajectory files with 10-minute longitude/latitude values. However I want to only load the hourly values into R.

My current code loads all files into R:

beussel.trajectories<- lapply(beusselstr.trajectory_files,read.table,
                          sep="",header=TRUE,skip=1,fill=TRUE,fileEncoding="latin1")

beusselstr.trajectory_files is a folder with 329 files and I have 10 other folders with the same amount of files. The files then have 10 minute values going backwards in time over 4 days, so each file has 21318 rows.

I uploaded a file for one day here https://www.dropbox.com/s/tp2d9lr7xawuvr4/Beusselstr_001_020710.txt

Loading all of them takes incredibly long and since my computer has a rather small amount of memory, R keeps giving me errors, that R cannot allocate a vector of 125kb.

Is this actually possible or will I have to use something else, like Fortran, to prepare my files?

lmo
  • 37,904
  • 9
  • 56
  • 69
ulrich_k
  • 11
  • 4
  • You will have to either import everything and subset, or you can use for example `sqldf` and use SQL query to subset desired rows. – Roman Luštrik May 08 '14 at 07:42
  • What platform are you running? If linux-based, take a look at `system`. You can manipulate the data in a shell before bringing it into R. – Rich Scriven May 08 '14 at 07:51

3 Answers3

2

If you only want to read every sixth row you could skip the first six rows (header included) and only read in one row, then skip the first twelve rows and read in one row, etc. You can do this with a nested lapply:

  beussel.trajectories <- lapply(beusselstr.trajectory_files,function(x) {do.call(rbind,lapply(seq(6,21318,by=6),function(y) read.table(x,sep="",header=F,fileEncoding="latin1",skip=y,nrow=1)))})

You can add the column names later.

Here is a reproducible example. It is slow, but if you first save the file on your computer it will run a lot faster.

df <- lapply('http://statland.org/R/RC/heartatk4R.txt', function(x) {do.call(rbind,lapply(seq(6,600,by=6),function(y) read.table(x,header=F,colClasses = c("character"),skip=y,nrows=1)))})
head(df[[1]])

#  V1    V2 V3  V4 V5        V6   V7  V8
#1  6 41091  M 121  0 6378.6400 0009 084
#2 12 41091  M 121  0 3930.6300 0005 072
#3 18 41091  M 121  0 4989.7100 0005 056
#4 24 41071  M 121  0 6297.7200 0006 073
#5 30 41011  M 122  0 9862.9000 0002 038
#6 36 41041  F 121  0 2584.1000 0009 081
Jonas Tundo
  • 6,137
  • 2
  • 35
  • 45
  • I should have been more specific about what the files look like. I added some more info. the files are 21318 rows long and have 10 minute values over 4 days. I think the skip-command would only work if told it to skip likes 1-5 then 7-11 and so on. – ulrich_k May 08 '14 at 07:41
  • Check the help file `?read.table`: http://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html. It should be `nrows`, it was a type, but `nrow` seems to work as well. – Jonas Tundo May 08 '14 at 08:19
  • Okay but wouldn't that mean that it only loads the first row of every file? Because that seems to have happened. The resulting list contains only the first row of every file. – ulrich_k May 08 '14 at 08:23
  • No, it skips a certain amount of rows and then only loads the desired row. – Jonas Tundo May 08 '14 at 08:29
  • Thanks a lot for your help! I think I will have to prepare the files with unix though. The process stops after a while giving me the error Error: cannot allocate vector of size 8.0 Mb – ulrich_k May 08 '14 at 08:49
  • 1
    JT85: `R` allows you to enter any shortened version of argument names so long as the entry is unambiguous. That's why `nr` or `nrow` is the same as `nrows` here. – Carl Witthoft May 08 '14 at 11:10
1

This is a somewhat duplicated question, see here

Anyway, a faster approach would be to read each file in its entirety with data.table::fread() and then subset the results.

library(plyr)
library(dplyr)
library(data.table)
library(microbenchmark)

# create a sample file
my_large_file <- 
    data_frame(
        var_every_six = rep(1:6, length.out = 250e3),
        num_var = rnorm(250e3),
        char_var = rep(letters, length.out = 250e3),
        char_var2 = rep(state.abb, length.out = 250e3)  
    )

# export to csv
write.csv(my_large_file, 'my_large_file.csv', row.names = F)

# compare reading and subsetting with fread()/data.table method with read.table() method
microbenchmark(
    {
        my_csv = fread('my_large_file.csv')
        my_csv_sub = my_csv[my_csv$var_every_six == 6, ]    
    }, 
    {   
        my_csv = read.csv('my_large_file.csv')
        my_csv_sub = my_csv[my_csv$var_every_six == 6, ]        
    }
)

# Unit: milliseconds
                                                                                                # expr
    # {     my_csv = fread("my_large_file.csv")     my_csv_sub = my_csv[my_csv$var_every_six == 6, ] }
 # {     my_csv = read.csv("my_large_file.csv")     my_csv_sub = my_csv[my_csv$var_every_six == 6, ] }
      # min       lq     mean   median       uq      max neval
 # 201.7967 210.1511 220.8584 216.2771 225.5729 291.8267   100
 # 574.0465 629.9268 655.1687 654.8691 682.1258 809.9981   100

The data.table approach is 3x faster, even with a super simple .csv. The performance gains from fread() typically increase with larger and more complicated files.

Community
  • 1
  • 1
Taylor White
  • 644
  • 5
  • 11
0

Maybe preprocess the file first using a suitable tool. For example if using linux

awk 'NR%6==2' /input/file  

> write(1:100, 'test.txt', sep = '\n')
> system("awk 'NR%6==2' test.txt > test1.txt")
> readLines('test1.txt')
 [1] "2"  "8"  "14" "20" "26" "32" "38" "44" "50" "56" "62" "68" "74" "80" "86"
[16] "92" "98"
jdharrison
  • 30,085
  • 4
  • 77
  • 89