0

i have around 15 GB of zipped data in 30 minute packages. unzipping and reading them with either unzip and readr or fread works just fine but the ram-requirements don't allow me to read in as many files as i wish. so i've tried to use the disk.frame package. in principle this also works fine but i noticed that around line 4000 of every read in file the columns get jumbled.

here is the code i use:

library(stringr)
library(tidyverse)
library(data.table)
library(compare)
library(disk.frame)

setup_disk.frame(future_backend=future::sequential) # tried to set sequential to avoid the problem
options(future.globals.maxSize = Inf)
   
dat_list<-list.files(pattern = ".*1936.data")

name_vec<-name_vec[c(1:6, 9:48, 51, 53:63)] # columns to drop

# the dat variable works perfectly but is memory constrained

#dat<-dat_list %>% map_df(fread, skip=7, drop=name_vec, #data.table=getOption("datatable.fread.datatable", T))
#Date_Time<-paste(dat$Date, str_sub(dat$Time, 1,8))
#dat<-dat[,-c(1:2)]
#dat<-cbind.data.frame(date=lubridate::ymd_hms(Date_Time, tz="UTC"), dat)

# disk frame trial

outdir="D:/***/Test"
test<-csv_to_disk.frame(infile=dat_list, outdir = outdir, 
                        skip=7, .progress=T, drop = name_vec, header=T, overwrite = T,
                        inmapfn = function(chunk){
                          chunk[, Date := lubridate::ymd_hms(paste(Date, str_sub(Time, 1,8)))] 
                          # trying to create a combined date_time variable from date analogue to the Date_Time variable above
                        }, 
                        data.table=getOption("datatable.fread.datatable", T))

test<-data.frame(test[,-2])
#dat<-data.frame(dat)
#compare(dat, test)

after reading the data the df looks like this for dat:

structure(list(date = structure(c(1554727203, 1554727203, 1554727203, 
1554727203, 1554727203, 1554727203, 1554727203, 1554727204, 1554727204, 
1554727204, 1554727204, 1554727204, 1554727204, 1554727204, 1554727204, 
1554727204, 1554727204, 1554727205, 1554727205, 1554727205, 1554727205
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `U (m/s)` = c(-3.59775, 
-3.89427, -3.78592, -3.93987, -4.14395, -4.22348, -4.27332, -4.34219, 
-4.46859, -4.71244, -4.39688, -4.39266, -4.04464, -4.23887, -4.43878, 
-4.46269, -4.55271, -4.45263, -4.50232, -4.35592, -4.07062), 
    `V (m/s)` = c(-1.49433, -1.79746, -1.69747, -1.41175, -1.80788, 
    -1.84414, -1.67488, -1.48056, -1.49211, -1.51781, -1.80034, 
    -1.86993, -1.82314, -1.54926, -1.37781, -1.51184, -1.41061, 
    -1.43523, -0.683048, -0.559152, -0.420025), `T (C)` = c(21.1527, 
    21.214, 21.195, 21.1651, 21.1972, 21.0915, 20.7849, 20.3886, 
    20.4152, 20.8369, 20.9407, 21.1197, 21.033, 20.7123, 20.8921, 
    21.0232, 21.1044, 21.157, 21.1208, 21.1468, 21.1597)), row.names = 3980:4000, class = "data.frame")

and like this for test:

structure(list(Date = structure(c(1554727203, 1554727203, 1554727203, 
1554727203, 1554727203, 1554727203, 1554727203, 1554727204, 1554728400, 
1554728400, 1554728400, 1554728400, 1554728400, 1554728400, 1554728400, 
1554728400, 1554728400, 1554728400, 1554728401, 1554728401, 1554728401
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), U..m.s. = c(-3.59775, 
-3.89427, -3.78592, -3.93987, -4.14395, -4.22348, -4.27332, -4.34219, 
-3.72044, -2.68918, -3.11362, -3.84935, -3.80292, -3.54106, -3.77755, 
-3.2498, -3.14659, -2.9482, -2.90917, -2.70361, -2.5597), V..m.s. = c(-1.49433, 
-1.79746, -1.69747, -1.41175, -1.80788, -1.84414, -1.67488, -1.48056, 
0.779225, 0.753698, 1.43587, 0.452789, 0.228636, -1.49971, -0.840048, 
-0.723638, -0.49741, -0.27166, -0.118487, -0.0760538, -0.107277
), T..C. = c(21.1527, 21.214, 21.195, 21.1651, 21.1972, 21.0915, 
20.7849, 20.3886, 21.8011, 21.7274, 21.7481, 21.7349, 21.7759, 
21.7998, 21.5799, 21.5692, 21.5885, 21.5234, 21.4854, 21.4857, 
21.5471)), row.names = 3980:4000, class = "data.frame")

those are lines 3980:4000 respectively and as you can see if you test it, they diverge from line 3888. those are still values from the same original 30 min package. i have no real idea why this happens. i thought it might be the "workers" or it might be the "Time" column but changeing either didn't seem to do much. any help would be immensely appreciated.

PS: windows 10, 8gb ram, R 4.0.2

D.J
  • 1,180
  • 1
  • 8
  • 17

1 Answers1

1

I can't see anything wrong. The one thing is that you can't assume the rows to be in the same order in a disk.frame.

Are able to add a unique id to each row? Then you can compare by id?

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • why can i not assume the same order of rows? – D.J Oct 03 '20 at 13:17
  • I suspect it's the skip 7 doing something wrong. hmmm. – xiaodai Oct 03 '20 at 13:19
  • if you collect(df) is the nrow the same as you expected? – xiaodai Oct 03 '20 at 13:19
  • yes. it has the same dimensions and basically everything is the same as with `dat` where everything IS correct. the skip=7 only skips some metadata and shouldn't be a problem (with data-table it isn't at least) – D.J Oct 03 '20 at 13:21
  • concerning the unique id, the date_time variable should be one although of course it is only if milliseconds are considered which makes the use of lubridate impossible (for me at least) – D.J Oct 03 '20 at 13:23
  • 1
    disk.frame cannot guarantee to read files in the same order as input due to parallel processing. – xiaodai Oct 03 '20 at 13:23
  • 1) im sorry to turn this into a discussion in the comments! 2) i thought so, that is why i put the `setup_disk.frame(future_backend=future::sequential)` as sequential. i assume this doesn't work as i expect it to – D.J Oct 03 '20 at 13:24
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/222448/discussion-between-xiaodai-and-d-j). – xiaodai Oct 03 '20 at 13:25
  • not sure why your code snippet works and mine doesn't as they look basically the same but it does work so... thank you very much! – D.J Oct 03 '20 at 14:08