I have a large data file (11 million observations) and there are columns for ID, year, month, time period (and variables, like speed that I'm interested in). I'd like to perform calculations for each of these and summarize the results in a new CSV, such that I'd results and formatted with rows for each unique ID/year/month/hour.
I was able to accomplish this with a series of nested loops, which worked fine when the file was smaller (a few thousands observations). I've been trying to find a better method with apply functions, but can't get the same structure. I'm using groupby to create a few new columns before loops, which runs quickly, but doesn't give me a summary output csv.
results = NULL
data.calc = NULL
tmp = NULL
PERIOD = 5:9
YEAR = 2014:2017
LINK = 1:5
MONTH = 1:12
for(link in LINK,
for (year in YEAR){
for (month in MONTH){
for (period in PERIOD){
data.calc = filter(data,
LinkID_Int == link,
Year==year,
MONTH==month,
Period==period
)
#Speed
spd.5 = quantile(data.calc$speed, 0.05)
spd.20 = quantile(data.calc$speed, 0.20)
spd.50 = quantile(data.calc$speed, 0.50)
spd.85 = quantile(data.calc$speed, 0.85)
spd.SD = sd(data.calc$speed)
tmp = tibble(link,
year,
month,
period,
spd.5, spd.20, spd.50, spd.85,
spd.SD,
)
results = rbind(results, tmp)
}
}
}
}
write.csv(results, file="C:/Users/...", row.names = FALSE)
This code works, but runs for hours with few results. I like the logic of for loops, meaning it's easy for me to read and understand what's happening, but I've seen plenty of posts that there are faster ways to go about this. I have about 30 actual calculations running in the loops, across several different variables.
I really appreciate any guidance on this.