0

I want to bring data to long format and integrate information from multiple columns.

Example data: Let's say we observe 4 products (id 1:4) in an online shop and the review comments (comment*) of different customers. One product (id = 1) has only one review comment, while another product (id = 4) has 4 comments. For each comment, we also observe if this comment quotes the comment of another user (1 if yes, and 0 otherwise).

data = data.frame(id = c(1,2,3,4), n_comments = c(2,1,3,4),
                   comment1 = c("consetetur sadipscing", "Lorem ipsum", "dolor sit ame", "nonumy eirmod "), comment1_quote = c(1,0,0,1),
                   comment2 = c("clita kasd gubergren", NA, "sanctus est", "consetetur sadipscing"), comment2_quote = c(0,NA,0,0),
                   comment3 = c(NA, NA, "invidunt ut labore", "ea rebum"), comment3_quote = c(NA,NA,1,0),
                   comment4 = c(NA, NA, NA, "dolores et ea rebum"), comment4_quote = c(NA,NA,NA,1))

data
  id n_comments              comment1 comment1_quote              comment2 comment2_quote           comment3 comment3_quote            comment4 comment4_quote
1  1          2 consetetur sadipscing              1  clita kasd gubergren              0               <NA>             NA                <NA>             NA
2  2          1           Lorem ipsum              0                  <NA>             NA               <NA>             NA                <NA>             NA
3  3          3         dolor sit ame              0           sanctus est              0 invidunt ut labore              1                <NA>             NA
4  4          4        nonumy eirmod               1 consetetur sadipscing              0           ea rebum              0 dolores et ea rebum              1

Now we want to bring this data to long format by

  1. having one row for each comment on each product
  2. adding the information if the comment referred to a quote
  3. keeping the aggregate number of comments that does not vary for one product

Here is the target data:

target_data = data.frame(id = c(1,1,2,3,3,3,4,4,4,4), n_comments = c(2,2,1,3,3,3,4,4,4,4),
                   comment = c("consetetur sadipscing", "Lorem ipsum", "dolor sit ame", "nonumy eirmod ","clita kasd gubergren", "sanctus est", "consetetur sadipscing",
                   "invidunt ut labore", "ea rebum", "dolores et ea rebum"),
                   quote = c(1,0,0,1,0,0,0,1,0,1))
 
target_data
   id n_comments               comment quote
1   1          2 consetetur sadipscing     1
2   1          2           Lorem ipsum     0
3   2          1         dolor sit ame     0
4   3          3        nonumy eirmod      1
5   3          3  clita kasd gubergren     0
6   3          3           sanctus est     0
7   4          4 consetetur sadipscing     0
8   4          4    invidunt ut labore     1
9   4          4              ea rebum     0
10  4          4   dolores et ea rebum     1

Here is what I've tried, but does not work:

trial_da = data %>%  tidyr::pivot_longer(cols = starts_with('comment'), values_to = "comment", values_drop_na = TRUE)
Fehler: Can't combine `comment1` <character> and `comment1_quote` <double>.
Run `rlang::last_error()` to see where the error occurred.

trial_da
Fehler: Objekt 'trial_da' not found

This is happening because the "quote" column also starts with "comment". However, I'm not sure how to solve this.

Scijens
  • 541
  • 2
  • 11

2 Answers2

2

After a bit of renaming of column names you can use tidyr::pivot_longer -

names(data) <- sub('comment\\d+_|\\d+', '', names(data))

tidyr::pivot_longer(data, 
                    cols = -c(id, n_comments), 
                    names_to = '.value',
                    names_pattern = '(comment|quote)', 
                    values_drop_na = TRUE)

#      id n_comments comment                 quote
#   <dbl>      <dbl> <chr>                   <dbl>
# 1     1          2 "consetetur sadipscing"     1
# 2     1          2 "clita kasd gubergren"      0
# 3     2          1 "Lorem ipsum"               0
# 4     3          3 "dolor sit ame"             0
# 5     3          3 "sanctus est"               0
# 6     3          3 "invidunt ut labore"        1
# 7     4          4 "nonumy eirmod "            1
# 8     4          4 "consetetur sadipscing"     0
# 9     4          4 "ea rebum"                  0
#10     4          4 "dolores et ea rebum"       1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • + 1, Ronak, in `names_pattern` you mentioned `|` does that mean those columns will be pivot longer in the output? i.e. had there been another column like `'(comment|quote|col)'`, the output will have additional `col` column in the pivoted output? – Karthik S Jun 24 '21 at 15:04
  • 1
    Yes, provided `col` is present in the column name. Here I renamed the columns to keep only `comment` and `quote` in the column name. – Ronak Shah Jun 25 '21 at 01:22
1

data.table approach

library(data.table)
ans <- setorder(
  melt(setDT(data), 
       id.vars = c("id", "n_comments"), 
       measure.vars = patterns(comment = "comment[0-9]+$", 
                               quote = ".*_quote"),
       na.rm = TRUE), id)
#    id n_comments variable               comment quote
# 1:  1          2        1 consetetur sadipscing     1
# 2:  1          2        2  clita kasd gubergren     0
# 3:  2          1        1           Lorem ipsum     0
# 4:  3          3        1         dolor sit ame     0
# 5:  3          3        2           sanctus est     0
# 6:  3          3        3    invidunt ut labore     1
# 7:  4          4        1        nonumy eirmod      1
# 8:  4          4        2 consetetur sadipscing     0
# 9:  4          4        3              ea rebum     0
#10:  4          4        4   dolores et ea rebum     1

You can drop the variable column witn ans[, variable := NULL] if desired.

Wimpel
  • 26,031
  • 1
  • 20
  • 37