1

I have a large data file (11 million observations) and there are columns for ID, year, month, time period (and variables, like speed that I'm interested in). I'd like to perform calculations for each of these and summarize the results in a new CSV, such that I'd results and formatted with rows for each unique ID/year/month/hour.

I was able to accomplish this with a series of nested loops, which worked fine when the file was smaller (a few thousands observations). I've been trying to find a better method with apply functions, but can't get the same structure. I'm using groupby to create a few new columns before loops, which runs quickly, but doesn't give me a summary output csv.

results = NULL
data.calc = NULL
tmp = NULL
PERIOD = 5:9
YEAR = 2014:2017
LINK = 1:5
MONTH = 1:12

for(link in LINK,
  for (year in YEAR){
    for (month in MONTH){
      for (period in PERIOD){

      data.calc = filter(data, 
        LinkID_Int == link,
        Year==year, 
        MONTH==month,
        Period==period
        )

      #Speed
      spd.5 = quantile(data.calc$speed, 0.05)
      spd.20 = quantile(data.calc$speed, 0.20)
      spd.50 = quantile(data.calc$speed, 0.50)
      spd.85 = quantile(data.calc$speed, 0.85)
      spd.SD = sd(data.calc$speed)

       tmp = tibble(link, 
                   year, 
                   month,
                   period, 

                   spd.5, spd.20, spd.50, spd.85, 
                   spd.SD, 

                   )

      results = rbind(results, tmp)

    }
  }
}
}

write.csv(results, file="C:/Users/...", row.names = FALSE)

This code works, but runs for hours with few results. I like the logic of for loops, meaning it's easy for me to read and understand what's happening, but I've seen plenty of posts that there are faster ways to go about this. I have about 30 actual calculations running in the loops, across several different variables.

I really appreciate any guidance on this.

Cody
  • 21
  • 4

2 Answers2

1

A lot of your slow-down I think is because you repeatedly filter your data (time-consuming with 11M rows). Since you're already using dplyr (for ::filter), I suggest a "tidy" way of doing this. Since we don't have your data, I'll demonstrate with mtcars:

library(dplyr)
mtcars %>%
  group_by(gear, vs, am) %>%
  summarize_at(vars(disp), .funs = list(~n(), ~mean(.), ~sd(.), q50 = ~quantile(.,0.5)))
# # A tibble: 7 x 7
# # Groups:   gear, vs [6]
#    gear    vs    am     n  mean    sd   q50
#   <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
# 1     3     0     0    12 358.   71.8 355  
# 2     3     1     0     3 201.   72.0 225  
# 3     4     0     1     2 160     0   160  
# 4     4     1     0     4 156.   14.0 157. 
# 5     4     1     1     6  88.9  20.4  78.8
# 6     5     0     1     4 229.  114.  223  
# 7     5     1     1     1  95.1 NaN    95.1

You can see how some columns are automatically named for the function, and one I over-rode. This is "just another frame" that can be exported (e.g., to CSV).

If you have more than one variable over which you want summarize statistics, just include them in your call to vars, and the column names break out appropriately:

mtcars %>%
  group_by(gear, vs, am) %>%
  summarize_at(vars(mpg, disp), .funs = list(~n(), ~mean(.), ~sd(.), q50 = ~quantile(.,0.5)))
# # A tibble: 7 x 11
# # Groups:   gear, vs [6]
#    gear    vs    am mpg_n disp_n mpg_mean disp_mean mpg_sd disp_sd mpg_q50 disp_q50
#   <dbl> <dbl> <dbl> <int>  <int>    <dbl>     <dbl>  <dbl>   <dbl>   <dbl>    <dbl>
# 1     3     0     0    12     12     15.0     358.    2.77    71.8    15.2    355  
# 2     3     1     0     3      3     20.3     201.    1.93    72.0    21.4    225  
# 3     4     0     1     2      2     21       160     0        0      21      160  
# 4     4     1     0     4      4     21.0     156.    3.07    14.0    21      157. 
# 5     4     1     1     6      6     28.0      88.9   5.12    20.4    28.8     78.8
# 6     5     0     1     4      4     19.1     229.    5.02   114.     17.8    223  
# 7     5     1     1     1      1     30.4      95.1 NaN      NaN      30.4     95.1

One more "BTW": iteratively building results using rbind(results, tmp) works fine for a few iterations, but it gets really slow. Because: each time you rbind, it makes a complete copy of all of the data in both. If results is 1M rows before the call to rbind, then while the row-binding is going on, you have (at least) 2M rows (1M rows, two copies) in memory at a time. While doing this once or twice is generally not a problem, you can see how doing this hundreds or thousands of times (depending on the number of factors you have) can be problematic.

Better practices include:

  • pre-allocating your output list with something like:

    out <- vector("list", prod(length(LINK), length(YEAR), length(MONTH), length(PERIOD))
    ind <- 0L
    for (...) {
      for (...) {
        for (...) {
          for (...) {
            tmp <- (do-stuff-here)
            ind <- ind + 1L
            out[[ind]] <- tmp
          }
        }
      }
    }
    out <- do.call(rbind, out)
    
  • do it within lapply and assign the output to out, though this is a little harder to concoct four-nested-for into a single lapply

I still argue that trying to do nested for and filtering data on each pass is a bad place to start. Even though you can remove the inefficiency of copying data each time with iterative-rbind, you'll still have unnecessary overhead of filtering.

But if you must, then consider filtering per-for:

out <- vector("list", prod(...)) # as above
ind <- 0L
for (lk in LINK) {
  dat_l <- mydat[LinkID_Int == lk,,drop=FALSE]
  for (yr in YEAR) {
    dat_y <- dat_l[Year == yr,,drop=FALSE]
    for (mh in MONTH) {
      dat_m <- dat_y[Month == mh,,drop=FALSE]
      for (pd in PERIOD) {
        data.calc <- dat_m[Period == pd,,drop=FALSE]
        tmp <- {do-stuff-here}
        ind <- ind + 1L
        out[[ ind ]] <- tmp
      }
    }
  }
}

In this case, at least each inner-loop is filtering on much less data. This is still inefficient, but is slightly less so.

(I still think the dplyr solution above is more readable, likely much faster, more maintainable, and more extensible.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 2
    The base alternative is: ```aggregate(speed ~ period + year + link + month, data = DF, FUN = function(x) c(mn = mean(x),n = length(x)))```. I like your answer. – Cole Aug 03 '19 at 00:21
  • Thank you - I am using dplyr and tried to learn summarize as you suggested, but could you provide some more information about how I can export the summary to csv? And yes, I have multiple variables with different calculations to perform across the filtered data. I had also been running some new calculations inside the loops (like count rows of each filtered segment) but noticed I can pull outside loop, groupby (and mutate onto the larger data set). Still learning. – Cody Aug 03 '19 at 12:17
  • Export to CSV? What's wrong with your use of `write.csv`? – r2evans Aug 03 '19 at 20:45
  • When I use summarize, what am I writing to csv? It's probably a basic question, I just don't know how to see the summary content in csv, I'm only familiar with rbind method, and couldn't find a simple answer. – Cody Aug 03 '19 at 21:27
  • Nothing about `summarize` is doing anything with files. What I didn't do in this example is *assign the result to a new variable* (e.g., `out <- mtcars %>% group_by(...) %>% summarize_at(...)`). Once you've done that, just save it separately with `write.csv`. – r2evans Aug 03 '19 at 23:39
  • Asking another way: is there anything about `write.csv(results, file="C:/Users/...", row.names = FALSE)` (in your question) that fails or does not do what you expect? – r2evans Aug 03 '19 at 23:50
  • Ok, I think I understand now. The summarise_at works really quickly, and seems to write to the csv if I follow the data like `data %>% group_by(LinkID_Int, Year, Month, Period) %>% summarise_at(vars(Volume,speed), .funs = list(~n(), ~mean(.), q50 = ~quantile(.,0.5), q90 = ~quantile(.,0.9), q10 = ~quantile(.,0.1)) )%>% write.csv(., file="C:/Users...)` – Cody Aug 05 '19 at 02:50
  • Yes, you can do it like that: saving in the `%>%` pipeline. That is fine as long as you don't intend to look at or work with the data anymore. If you want to verify the data before or after saving to a file, though, you should probably save it to a temporary variable, even if you look at it once them `rm` the var. But that's completely dependent on your immediate workflow needs. Glad it make sense! – r2evans Aug 05 '19 at 14:42
1

Always avoid running rbind in a loop as it leads to excessive copying in memory. See Patrick Burns' Circle 2, "Growing Objects", of R Inferno.

Since you require inline grouped aggregation consider base R's ave which returns same length as input vector so can be assigned to new columns.

results <- transform(data, 
      spd.5 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.05)),
      spd.20 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.2)),
      spd.50 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.5)),
      spd.85 = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=function(x) quantile(x, 0.85)),
      spd.SD = ave(speed, LinkID_Int, Year, MONTH, Period, FUN=sd)
)

For full grouping aggregation of your data consider base R's aggregate:

agg_raw <- aggregate(speed ~ Year + MONTH + Period, 
                     function(x) c(spd.5 = unname(quantile(x, 0.05)),
                                   spd.20 = unname(quantile(x, 0.2)),
                                   spd.50 = unname(quantile(x, 0.5)),
                                   spd.85 = unname(quantile(x, 0.85)),
                                   spd.SD = sd(x))
       )

results <- do.call(data.frame, agg_raw)
colnames(results) <- gsub("speed.", "", colnames(results))
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thank you - I copied this in and then added the same write csv command, but ended up all 11 million data points again. Am I missing a step to see a summary set of data (percentile by each period, of each month, of each year, of each link)? I had run groupby and mutate to tack on calculations to the larger data, but got stuck on how to summarize at the end. – Cody Aug 03 '19 at 12:24
  • See update with extension for aggregation. Your posted code really only added quantile columns by group(s). – Parfait Aug 03 '19 at 13:54
  • +1 I particularly like the reference to the R Inferno ... I have read it in the past and keep forgetting to reference it in times like this. – r2evans Aug 04 '19 at 00:50
  • Indeed. I actually want to reference a StackOverflow answer but none in R on this issue is available. Python pandas do have good answers pointing to the quadratic copy of running `append`/`concat` (counterpart to `rbind`) in a loop: [@unutbu](https://stackoverflow.com/a/36489724/1422451) and [@Alexander](https://stackoverflow.com/a/37009561/1422451) – Parfait Aug 04 '19 at 01:03