R - pivoting duplicate rows into multiple column with unknown number of columns

Question

I have a data frame like this:

Created with this:

companies = c("ABC Ltd", "ABC Ltd", "ABC Ltd", "Derwent plc", "Derwent plc")
sic = c("12345", "24155", "31231", "55346", "34234")

df = data.frame(companies, sic)

As you can see, the companies column is duplicated due to the SIC code.

I want to pivot wider so that each SIC code has its own column and that it is only 1 company per row.

Something like the following where I don't know how many columns there might be (i.e. there could be some companies with 20 sic codes).

I have tried pivoting it using pivot_wider but I cannot get it to do what I need it to do.

Any help is much appreciated.

`data.table` : `dcast(setDT(df), companies~rowid(companies), value.var = 'sic')` — Ronak Shah, Jul 14 '20 at 10:25

score 4 · Accepted Answer · answered Jul 14 '20 at 10:09

4

With the packages dplyr and tidyr you can use

library(dplyr)
library(tidyr)

df %>% 
  group_by(companies) %>% 
  mutate(row_n = row_number()) %>% 
  pivot_wider(companies, names_from = row_n, values_from = sic, names_glue = "sic.{row_n}")

Output

# A tibble: 2 x 4
# Groups:   companies [2]
#   companies   sic.1 sic.2 sic.3
#   <chr>       <chr> <chr> <chr>
# 1 ABC Ltd     12345 24155 31231
# 2 Derwent plc 55346 34234 NA

answered Jul 14 '20 at 10:09

Ric S

9,073
3
25
51

Perfect. Just what I needed. Thank you so much 'Ric S' for your quick response :)... that 'names_glue' is really nifty! – Nicholas Jul 14 '20 at 10:10
1

Yeah it is a nice functionality that helps you renaming the columns as you like without renaming them in an explicit step – Ric S Jul 14 '20 at 10:11

GKi · Answer 2 · 2020-07-14T10:12:14.957

2

You can split sic by companies, call [ with 1:max(lengths(x)) and rbind the result.

x <- split(df$sic, df$companies)
do.call(rbind, lapply(x, "[", 1:max(lengths(x))))
#            [,1]    [,2]    [,3]   
#ABC Ltd     "12345" "24155" "31231"
#Derwent plc "55346" "34234" NA

edited Jul 14 '20 at 10:12

answered Jul 14 '20 at 10:10

GKi

37,245
2
26
48

1

Thank you GKi, super helpful!.. I appreciate your quick response – Nicholas Jul 14 '20 at 10:12

score 1 · Answer 3 · answered Jul 14 '20 at 10:13

You have problems because there's not yet a "time" variable that differentiates the measures for each ID. You could use ave to make one and use reshape.

res <- reshape(transform(df, t=ave(companies, companies, FUN=seq)), 
        idvar="companies", timevar="t", direction="wide")
res
#     companies sic.1 sic.2 sic.3
# 1     ABC Ltd 12345 24155 31231
# 4 Derwent plc 55346 34234  <NA>

However, you may want to reconsider your data on which measurements of the IDs correspond to each other!

Thank you jay.sf, appreciate your quick response. I take your point. I have never seen 'ave' been used before, learnt a lot today! — Nicholas, Jul 14 '20 at 10:16

R - pivoting duplicate rows into multiple column with unknown number of columns

3 Answers3

Linked

Related