i have around 15 GB of zipped data in 30 minute packages. unzipping and reading them with either unzip and readr or fread works just fine but the ram-requirements don't allow me to read in as many files as i wish. so i've tried to use the disk.frame package. in principle this also works fine but i noticed that around line 4000 of every read in file the columns get jumbled.
here is the code i use:
library(stringr)
library(tidyverse)
library(data.table)
library(compare)
library(disk.frame)
setup_disk.frame(future_backend=future::sequential) # tried to set sequential to avoid the problem
options(future.globals.maxSize = Inf)
dat_list<-list.files(pattern = ".*1936.data")
name_vec<-name_vec[c(1:6, 9:48, 51, 53:63)] # columns to drop
# the dat variable works perfectly but is memory constrained
#dat<-dat_list %>% map_df(fread, skip=7, drop=name_vec, #data.table=getOption("datatable.fread.datatable", T))
#Date_Time<-paste(dat$Date, str_sub(dat$Time, 1,8))
#dat<-dat[,-c(1:2)]
#dat<-cbind.data.frame(date=lubridate::ymd_hms(Date_Time, tz="UTC"), dat)
# disk frame trial
outdir="D:/***/Test"
test<-csv_to_disk.frame(infile=dat_list, outdir = outdir,
skip=7, .progress=T, drop = name_vec, header=T, overwrite = T,
inmapfn = function(chunk){
chunk[, Date := lubridate::ymd_hms(paste(Date, str_sub(Time, 1,8)))]
# trying to create a combined date_time variable from date analogue to the Date_Time variable above
},
data.table=getOption("datatable.fread.datatable", T))
test<-data.frame(test[,-2])
#dat<-data.frame(dat)
#compare(dat, test)
after reading the data the df looks like this for dat:
structure(list(date = structure(c(1554727203, 1554727203, 1554727203,
1554727203, 1554727203, 1554727203, 1554727203, 1554727204, 1554727204,
1554727204, 1554727204, 1554727204, 1554727204, 1554727204, 1554727204,
1554727204, 1554727204, 1554727205, 1554727205, 1554727205, 1554727205
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), `U (m/s)` = c(-3.59775,
-3.89427, -3.78592, -3.93987, -4.14395, -4.22348, -4.27332, -4.34219,
-4.46859, -4.71244, -4.39688, -4.39266, -4.04464, -4.23887, -4.43878,
-4.46269, -4.55271, -4.45263, -4.50232, -4.35592, -4.07062),
`V (m/s)` = c(-1.49433, -1.79746, -1.69747, -1.41175, -1.80788,
-1.84414, -1.67488, -1.48056, -1.49211, -1.51781, -1.80034,
-1.86993, -1.82314, -1.54926, -1.37781, -1.51184, -1.41061,
-1.43523, -0.683048, -0.559152, -0.420025), `T (C)` = c(21.1527,
21.214, 21.195, 21.1651, 21.1972, 21.0915, 20.7849, 20.3886,
20.4152, 20.8369, 20.9407, 21.1197, 21.033, 20.7123, 20.8921,
21.0232, 21.1044, 21.157, 21.1208, 21.1468, 21.1597)), row.names = 3980:4000, class = "data.frame")
and like this for test:
structure(list(Date = structure(c(1554727203, 1554727203, 1554727203,
1554727203, 1554727203, 1554727203, 1554727203, 1554727204, 1554728400,
1554728400, 1554728400, 1554728400, 1554728400, 1554728400, 1554728400,
1554728400, 1554728400, 1554728400, 1554728401, 1554728401, 1554728401
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), U..m.s. = c(-3.59775,
-3.89427, -3.78592, -3.93987, -4.14395, -4.22348, -4.27332, -4.34219,
-3.72044, -2.68918, -3.11362, -3.84935, -3.80292, -3.54106, -3.77755,
-3.2498, -3.14659, -2.9482, -2.90917, -2.70361, -2.5597), V..m.s. = c(-1.49433,
-1.79746, -1.69747, -1.41175, -1.80788, -1.84414, -1.67488, -1.48056,
0.779225, 0.753698, 1.43587, 0.452789, 0.228636, -1.49971, -0.840048,
-0.723638, -0.49741, -0.27166, -0.118487, -0.0760538, -0.107277
), T..C. = c(21.1527, 21.214, 21.195, 21.1651, 21.1972, 21.0915,
20.7849, 20.3886, 21.8011, 21.7274, 21.7481, 21.7349, 21.7759,
21.7998, 21.5799, 21.5692, 21.5885, 21.5234, 21.4854, 21.4857,
21.5471)), row.names = 3980:4000, class = "data.frame")
those are lines 3980:4000 respectively and as you can see if you test it, they diverge from line 3888. those are still values from the same original 30 min package. i have no real idea why this happens. i thought it might be the "workers" or it might be the "Time" column but changeing either didn't seem to do much. any help would be immensely appreciated.
PS: windows 10, 8gb ram, R 4.0.2