0

Friends, hello! I have a dataframe of some 2 million employees from one industry and the firms they have worked with from 2000-2019. It looks something like this: 

| ID |  FIRM | NAICS | Q1 | Q2 | Q3 | Q4 |
| A  |   001 |   100 |  1 |  1 |  1 |  1 |
| B  |   002 |   200 |  1 |  1 |  0 |  0 |
| B  |   003 |   100 |  0 |  0 |  1 |  1 |

... where NAICS is the industry code of the company where someone worked in a given quarter. For example, Person A stayed in the same industry all four quarters; Person B moved from Industry 200 to Industry 100 in Q3 by moving firms.

The frame I would like to end up with looks like this: 

| ID |  Q1 |  Q2 |  Q3 |  Q4 |
| A  | 100 | 100 | 100 | 100 |
| B  | 200 | 200 | 100 | 100 |

... so that I can track what industries people are coming from, and then build something akin to this NYT animated sankey diagram.

Could you help me figure out how to get from the first to the second? My hunch is that I need to be use dplyr::pivot_longer() and dplyr::pivot_wider(), which I understand are replacing spread() and gather()

2 Answers2

2

There's basically three steps here:

  • Make it long so that each quarter is its own row
  • Drop the rows where the firm was "inactive" in that quarter
  • Make it wide, spreading the "quarter" variable into columns
library(dplyr)
library(tidyr)

df %>%
  pivot_longer(
    cols = Q1:Q4,
    names_to = "quarter",
    values_to = "active"
  ) %>%
  filter(active == 1) %>%
  select(- FIRM, - active) %>%
  pivot_wider(
    names_from = quarter,
    values_from = NAICS
  )

Output:

# A tibble: 2 x 5
  ID       Q1    Q2    Q3    Q4
  <fct> <int> <int> <int> <int>
1 A       100   100   100   100
2 B       200   200   100   100
Marius
  • 58,213
  • 16
  • 107
  • 105
  • So helpful--thanks!! And @dc37 is right; this crashes out with a larger dataframe ("# Error: vector memory exhausted (limit reached?)". – Mark Clayton Hand Feb 27 '20 at 02:26
  • 1
    Ah OK, I don't know exactly how efficient `pivot_longer`/`wider` are with larger datasets. I would think that R should be able to deal with an 8 miilion row (2 million x 4) dataset, so maybe the pivot functions are using a lot of memory while carrying out the pivot. – Marius Feb 27 '20 at 02:30
1

As your dataframe contains ID for 2 millions of employees, maybe a solution with melt and dcast functions from data.table package will be faster for you:

library(data.table)
setDT(dt)
Col <- paste("Q",1:4,sep = "")
dt.m <- melt(dt, measure = list(Col), variable.name = "quarter", value.name = "value")
dt.m <- dt.m[value == 1]
dt.m[,value:= NULL]
dt.m <- dcast(dt.m, ID~quarter, value.var = "NAICS")

   ID  Q1  Q2  Q3  Q4
1:  A 100 100 100 100
2:  B 200 200 100 100

Reproducible Data

dt <- data.frame(ID = c("A","B","B"),
                 NAICS = c(100,200,100),
                 Q1 = c(1,1,0),
                 Q2 = c(1,1,0),
                 Q3 = c(1,0,1),
                 Q4 = c(1,0,1))
dc37
  • 15,840
  • 4
  • 15
  • 32
  • 1
    It remove the column `value` as we don't need anymore. It is equivalent to `select(-active)` in @Marius's answer. – dc37 Feb 27 '20 at 03:25
  • Thanks @dc37! I thought we had this one--but turns out I get the same overload problem here as with the dplyr option: `Error: vector memory exhausted (limit reached?)` Suggestions welcome; regardless, grateful for both your and @marius's answers. – Mark Clayton Hand Feb 27 '20 at 03:57
  • 1
    ... hard to find a good suggestion. It will depend how big is your file... I'm not enough an expert on manipulating very heavy file. Sorry for that. Maybe if you can find a way to work on a cluster, it will makes things easier. But it is well beyond my knowledge. Good luck – dc37 Feb 27 '20 at 04:45