I have a dataset (MN_Census) that has information for all census tracts for the following years: 1990, 2000, 2010, and 2020. The variable ID that identifies the census tract is "GISJOIN". My dataset looks like this:
GISJOIN | GEOGYEAR | STATE | STATEA | COUNTY | COUNTYA | TRACTA | CL8AA | CL9AA | DATAYEAR |
---|---|---|---|---|---|---|---|---|---|
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 1954.45 | 0 | 1990 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2045.99 | 0 | 1990 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2361.32 | 0 | 2000 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2862.97 | 0 | 2000 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2327 | 0 | 2010 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3262 | 0 | 2010 |
G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2244 | NA | 2020 |
G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3120 | NA | 2020 |
I'm trying to use either join/merge/rbind to add foreach census tract, the years in between. This is, I want to add the following dataset foreach of the census tract:
YearTransaction |
---|
1990 |
1991 |
1992 |
1993 |
1994 |
1995 |
1996 |
1997 |
1998 |
1999 |
2000 |
2001 |
2002 |
2003 |
2004 |
So at the end, my desired dataset will have all years (from 1990 to 2020) foreach one of the census tract. I'm trying using "multidplyr
" to make a partition by census tract ID (which is represented by the variable "GISJOIN"), and join each one of these partitions with the Years dataset. I'm using the following code:
library("parallel")
library("modeest")
library("multidplyr")
library("doParallel")
library("dtplyr")
detectCores()
cluster <- new_cluster(4)
base <- MN_Census %>% group_by(GISJOIN) %>% partition(cluster)
system.time(final <- dplyr::left_join(Years,base, by=c("YearTransaction"="DATAYEAR"),copy=TRUE) %>% collect())
final <- final %>% dplyr::ungroup()
However, I'm not getting the desired results. What I'm getting is a dataaset with all years (which is good), but these years are not duplicated foreach census tract (GISJOIN):
The desired dataset should look like this:
DATAYEAR | GISJOIN | GEOGYEAR | STATE | STATEA | COUNTY | COUNTYA | TRACTA | CL8AA | CL9AA |
---|---|---|---|---|---|---|---|---|---|
1990 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 1954.45 | 0 |
1991 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2000 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2361.32 | 0 |
2001 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2010 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2327 | 0 |
2011 | G2700010770100 | ||||||||
… | G2700010770100 | ||||||||
2020 | G2700010770100 | 2010 | Minnesota | 27 | Aitkin County | 1 | 770100 | 2244 | NA |
1990 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2045.99 | 0 |
1991 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2000 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 2862.97 | 0 |
2001 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2010 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3262 | 0 |
2011 | G2700010790502 | ||||||||
… | G2700010790502 | ||||||||
2020 | G2700010790502 | 2010 | Minnesota | 27 | Aitkin County | 1 | 790502 | 3120 | NA |
How can I fix this using multidplyr or any kind of parallel processing? (My original dataset is huge). Any help using dplyr is really appreciated it!