I have two questions: What resources do you recommend reading to improve data manipulation capabilities? I've been working with larger datasets and have been struggling to adapt--I feel like I'm hitting a brick wall and don't know where to look (many online resources get too complicated without building foundation).
For example, I am trying to solve this issue. I have a df with millions of rows and I am trying to simplify it and analyze a trend. I have a dput example. I am trying to isolate each ID and grab the minimum value for a given year. (Some IDs have years not available for others). After simplifying that data, I am trying to add a percent change column. Given this is a 20+ year time series, I am ok with ignoring months at this point, as minimum value for a year compared to a minimum to another year should yield a reasonable percent change.
Thanks!
Input:
structure(list(ID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), .Label = c("a", "b"), class = "factor"), Date = structure(c(1L,
2L, 3L, 4L, 5L, 6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 5L,
6L, 10L, 12L, 14L, 7L, 8L, 9L, 11L, 13L, 15L, 16L), .Label = c("2/21/2009",
"2/22/2009", "2/23/2009", "2/24/2009", "2/25/2009", "2/26/2009",
"3/2/2011", "3/3/2011", "3/4/2011", "3/5/2010", "3/5/2011", "3/6/2010",
"3/6/2011", "3/7/2010", "3/7/2011", "3/8/2011"), class = "factor"),
Year = c(2009L, 2009L, 2009L, 2009L, 2009L, 2009L, 2010L,
2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2009L, 2009L,
2010L, 2010L, 2010L, 2011L, 2011L, 2011L, 2011L, 2011L, 2011L,
2011L), Value = c(10, 11, 12, 13, 14, 15, 16, 17, 18, 19,
20, 21, 22, 5, 6, 7, 8, 8, 9, 10, 11, 12, 15, 23, 25, 27)), .Names = c("ID",
"Date", "Year", "Value"), class = "data.frame", row.names = c(NA,
-26L))
Expected output:
structure(list(ID = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("a",
"b"), class = "factor"), Date = structure(c(1L, 4L, 5L, 2L, 4L,
3L), .Label = c("2/21/2009", "2/25/2009", "3/2/2011", "3/5/2010",
"3/6/2011"), class = "factor"), Year = c(2009L, 2010L, 2011L,
2009L, 2010L, 2011L), Value = c(10, 16, 5, 6, 8, 10), Percent.Increase = c(NA,
0.6, -0.6875, NA, 0.333333333, 0.25)), .Names = c("ID", "Date",
"Year", "Value", "Percent.Increase"), class = "data.frame", row.names = c(NA,
-6L))