3

I am working with the R programming language.

I have a dataset that looks something like this:

x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)

my_data = data.frame(x,  date_1, date_1_1, date_2, date_2_1)

      x  date_1 date_1_1  date_2 date_2_1
1 GROUP CLASS 1  CLASS 2 CLASS 1  CLASS 2
2     A      20       37      15       84
3     B      60       22     100       18
4     C      82        8      76       88
    

I am trying to restructure the data so it looks like this:

  • note : in the real excel data, date_1 is the same date as date_1_1 and date_2 is the same as date_2_1 ... R wont accept the same names, so I called them differently

enter image description here

Currently, I am manually doing this in Excel using different "tranpose" functions - but I am wondering if there is a way to do this in R (possibly using the DPLYR library).

I have been trying to read different tutorial websites online (Pivoting), but so far nothing seems to match the problem I am trying to work on.

Can someone please show me how to do this?

Thanks!

stats_noob
  • 5,401
  • 4
  • 27
  • 83
  • 5
    Since you're intentionally corrupting your numbers into strings by including `"CLASS 1"` (etc) as a value in each column, what should happen to those strings? It seems you expect to be able to use that row of _data_ to define what the column name should look like. – r2evans Dec 08 '22 at 21:56
  • 2
    It would be better for both your question and anyone trying to answer if you provide the data as it really is, rather than force an additional header into the first row as @r2evans suggests. – Peter Dec 08 '22 at 21:57
  • You can use the `check.names = FALSE` argument to `data.frame` if you're trying to force duplicate column names. – neilfws Dec 08 '22 at 22:12
  • you are looking for `pivot_longer` functionality here – GWD Dec 08 '22 at 22:19
  • or `data.table::melt` with its extension of `measure.vars=patterns(..,..)` – r2evans Dec 08 '22 at 22:43

2 Answers2

1

Made assumptions about your data because of the duplicate column names. For example, if the Column header pattern is CLASS_ClassNum_Date

df<-data.frame(GROUP = c("A", "B", "C"),
               CLASS_1_1 = c(20, 60, 82),
               CLASS_2_1 = c(37, 22, 8),
               CLASS_1_2 = c(15,100,76),
               CLASS_2_2 = c(84, 18,88))
library(tidyr)
pivot_longer(df, -GROUP, 
             names_pattern = "(CLASS_.*)_(.*)", 
             names_to = c(".value", "Date"))
  GROUP Date  CLASS_1 CLASS_2
  <chr> <chr>   <dbl>   <dbl>
1 A     1          20      37
2 A     2          15      84
3 B     1          60      22
4 B     2         100      18
5 C     1          82       8
6 C     2          76      88

Edit: Substantially improved pivot_longer by using names_pattern= correctly

M.Viking
  • 5,067
  • 4
  • 17
  • 33
  • 1
    Try with `names_to = c(".value", "Date"))` to spare you the pivot_wider. (; – stefan Dec 08 '22 at 22:36
  • Thank you for your answer! Is there some way I can just write all the column names instead of using "names_pattern"? Thanks! – stats_noob Dec 08 '22 at 23:53
  • Yes, using the or (`|`) command separating names, what are an example of all these column names? Example using the `|` - https://stackoverflow.com/a/74736627/10276092 – M.Viking Dec 09 '22 at 00:24
1

There are lots of ways to achieve your desired outcome, but I don't believe there is an 'easy'/'simple' way. Here is one potential solution:

library(tidyverse)
library(vctrs)

x = c("GROUP", "A", "B", "C")
date_1 = c("CLASS 1", 20, 60, 82)
date_1_1 = c("CLASS 2", 37, 22, 8)
date_2 = c("CLASS 1", 15,100,76)
date_2_1 = c("CLASS 2", 84, 18,88)

my_data = data.frame(x,  date_1, date_1_1, date_2, date_2_1)

# Combine column names with the names in the first row
colnames(my_data) <- paste(my_data[1,], colnames(my_data), sep = "-")

my_data %>%
  filter(`GROUP-x` != "GROUP") %>% # remove first row (info now in column names)
  pivot_longer(everything(), # pivot the data
               names_to = c(".value", "Date"),
               names_sep = "-") %>%
  mutate(GROUP = vec_fill_missing(GROUP, # fill NAs in GROUP introduced by pivoting
                                  direction = "downup")) %>%
  filter(Date != "x") %>% # remove "unneeded" rows
  mutate(`CLASS 2` = vec_fill_missing(`CLASS 2`, # fill NAs again 
                                      direction = "downup")) %>%
  na.omit() %>% # remove any remaining NAs
  mutate(across(starts_with("CLASS"), ~as.numeric(.x)),
         Date = str_extract(Date, "\\d+")) %>%
  rename("date" = "Date", # rename the columns
         "group" = "GROUP",
         "count_class_1" = `CLASS 1`,
         "count_class_2" = `CLASS 2`) %>%
  arrange(date) # arrange by "date" to get your desired output
#> # A tibble: 6 × 4
#>   date  group count_class_1 count_class_2
#>   <chr> <chr>         <dbl>         <dbl>
#> 1 1     A                20            37
#> 2 1     B                60            84
#> 3 1     C                82            18
#> 4 2     A                15            37
#> 5 2     B               100            22
#> 6 2     C                76             8

Created on 2022-12-09 with reprex v2.0.2

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46