2

I have two questions: What resources do you recommend reading to improve data manipulation capabilities? I've been working with larger datasets and have been struggling to adapt--I feel like I'm hitting a brick wall and don't know where to look (many online resources get too complicated without building foundation).

For example, I am trying to solve this issue. I have a df with millions of rows and I am trying to simplify it and analyze a trend. I have a dput example. I am trying to isolate each ID and grab the minimum value for a given year. (Some IDs have years not available for others). After simplifying that data, I am trying to add a percent change column. Given this is a 20+ year time series, I am ok with ignoring months at this point, as minimum value for a year compared to a minimum to another year should yield a reasonable percent change.

Thanks!

Input:

structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L, 
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L, 
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009", 
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009", 
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010", 
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"), 
    Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L, 
    2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L, 
    2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L, 
    2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 
    20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID", 
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA, 
-26L))

Expected output:

structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a", 
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L, 
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010", 
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L, 
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA, 
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date", 
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA, 
-6L))
sammyramz
  • 533
  • 2
  • 5
  • 13
  • 1
    As far as what to read goes, the data.table vignettes are a good place to start: https://github.com/Rdatatable/data.table/wiki/Getting-started For guidance on how to think about organizing data, I'd recommend Hadley's article https://www.jstatsoft.org/article/view/v059i10 even though it does not use data.table syntax. – Frank Jul 10 '16 at 01:34

2 Answers2

3

After grouping by 'ID', 'Year', we slice the min "Value" rows within each group, then grouped by 'ID', we create the 'Percent.Increase' by subtracting the 'Value' from lag of 'Value' and dividing by the lag of 'Value'.

res <-  df1 %>%
         group_by(ID, Year) %>%
         slice(which.min(Value)) %>% 
         group_by(ID) %>%
         mutate(Percent.Increase = (Value-lag(Value))/lag(Value))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • It is insane how easy you make this. Thank you @akrun! Do you recommend a particular resource/method to learn dplyr? – sammyramz Jul 09 '16 at 19:03
  • 1
    @sammyramz I think the best way to understand would be to practice, commit mistakes, learn from that, and of course read the official documentation. – akrun Jul 09 '16 at 19:06
2

Until HAVING clause is implemented in data.table, this seems to be pretty efficient way:

dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1
   ][, Percent_Increase := {
       tmp <- shift(Value)
       (Value-tmp)/tmp
   }, .(ID)]

Check timing on 5e7.

library(dplyr)
library(data.table)
N = 5e7
set.seed(1)
df = data.frame(ID = sample(2L, N, TRUE), 
                Date = sample(16L, N, TRUE), 
                Year = sample(2009:2011, N, TRUE), 
                Value = sample(N/10, N, TRUE))
dt = as.data.table(df)
system.time(
    res <- df %>%
        group_by(ID, Year) %>%
        slice(which.min(Value)) %>% 
        group_by(ID) %>%
        mutate(Percent_Increase = (Value-lag(Value))/lag(Value))    
)
#   user  system elapsed 
#  1.676   2.176   3.847
system.time(
    r <- dt[dt[, .I[which.min(Value)],, .(ID, Year)]$V1,
            ][, Percent_Increase := {
                tmp <- shift(Value)
                (Value-tmp)/tmp
            }, .(ID)]
)
#   user  system elapsed 
#  0.940   0.460   1.334
all.equal(r, as.data.table(res), ignore.col.order = TRUE, check.attributes = FALSE, ignore.row.order = TRUE)
#[1] TRUE
jangorecki
  • 16,384
  • 4
  • 79
  • 160