R from long to wide dataframe with real valued columns

Question

I have a quick question that relates to reshaping my data frame where I have ID "grouped_by" data. I have the following schema of the df (+ 2 exemplary instances that I wish to widen (in total I have >5000)):

   id                  solver   scoreA  scoreB  group   size 
   <chr>               <chr>    <dbl>   <dbl>   <chr>   <dbl>
 1 instance_1          s1       1        0.5    g1      1000                     
 2 instance_1          s2       100      50     g1      1000

... what I want to gain is:

   id           solver.best  scoreA.s1  scoreA.s2  scoreB.s1   scoreB.s2  group   size 
   <chr>        <chr>        <dbl>      <dbl>      <dbl>       <dbl>      <chr>   <dbl>
 1 instance_1   s1           1          100        0.5         50         g1      1000

Appreciate your help. BR

If you have over 5000 rows, and you want to do this - you will have around 5000 columns, this is almost surely the wrong way to go. What you are trying to accomplish should have an alternative route. — dvd280, Aug 24 '20 at 11:49
the ID is always grouping 2 instances... basically saying I want two rows becoming flat. i.e. groing from 5000 rows to 2500 — gero, Aug 24 '20 at 11:59
You can do the math, what's the difference between 5 x 5000 and 2500 x 5000 ? — dvd280, Aug 24 '20 at 12:03
I do not see your point. I admit that spreading via tidyverse techniques may not lead to the preferred result and with no grouping one will end up in 2500 x 5000 df. Other than that I dont see any fatality in the idea to shrink 2 rows into one once grouped by ID... — gero, Aug 24 '20 at 12:07
The purpose of asking you to share `dput` is that we can copy your data (or part of it) in our R environment and use it. When you update your post with incomplete `dput` (which we cannot copy) it's as good as not sharing at all. If your data is very big you can share only first few rows like `dput(head(df))` for first 6 rows. — Ronak Shah, Aug 24 '20 at 13:04
I know, but the problem is that the data is confidential and it is not easy for me to pseudonyze/mask it. To be honest, I think that the schema gives the idea -- after id: instance_1 comes instance_2 --> again pairwise. All the way up until 10000, so we have 5000 unique IDs. — gero, Aug 24 '20 at 13:08

score 1 · Answer 1 · answered Aug 24 '20 at 11:40

1

Maybe you can try the code below

reshape(within(df, Q <- ave(seq(nrow(df)), id, FUN = seq_along)),
  direction = "wide",
  idvar = "id", 
  timevar = "Q"
)

which gives

> reshape(cbind(df,Q = seq(nrow(df))),direction = "wide",idvar = "id",timevar = "Q")
          id solver.1 scoreA.1 scoreB.1 group.1 size.1 solver.2 scoreA.2
1 instance 1       s1        1      0.5      g1   1000       s2      100
  scoreB.2 group.2 size.2
1       50      g1   1000

Data

> dput(df)
structure(list(id = c("instance 1", "instance 1"), solver = c("s1", 
"s2"), scoreA = c(1L, 100L), scoreB = c(0.5, 50), group = c("g1",
"g1"), size = c(1000L, 1000L)), class = "data.frame", row.names = c("1",
"2"))

answered Aug 24 '20 at 11:40

ThomasIsCoding

96,636
9
24
81

thanks, but it only returns NA values for me :/.... I have more tham just 2 rows. But there are always 2 instances groubed_by ID... – gero Aug 24 '20 at 11:45
1

@gero Did you use the data in my answer? Otherwise, please `dput()` your data then I will have a look what happened – ThomasIsCoding Aug 24 '20 at 11:47
structure(list(id = c("instance 1", "instance 1", "instance 2", "instance 2", [... more instances/groups ...], solver = c("S1", "S2", [... more solver ...], scoreA = c(3.3818, 358.1937, ...., scoreB = c(1.3818, 100.1937, ...., group = c("g1", "g1", ...., size = c("1000", "1000", ....) – gero Aug 24 '20 at 11:53
1

@gero Could you put you data in your post? It seems the data you showed in the comment is not complete – ThomasIsCoding Aug 24 '20 at 12:13
The data is confidential. However, here is everything from dput() (find in edited post) – gero Aug 24 '20 at 12:29

score 0 · Answer 2 · answered Aug 24 '20 at 13:19

As I still wish to have a handy, e.g. tidyverse, best-practice, I still want to share the practical approach, which works just as fine conceptually :):

# create empty (wide) target df
wide_df <- data.frame(matrix(ncol = 8, nrow = 0))

names <- c("id", "best_solver", "scoreA_s1", "scoreA_s2",
           "scoreB_s1", "scoreB_s2", "group", "size")
colnames(wide_df) <- names


# traverse grouped by and arranged original (long) df 
for(i in seq(2, length(long_df$group), by = 2)){
  wide_df[i/2, "id"] <- long_df[i, "id"]
  wide_df[i/2, "best_solver"] <- long_df[which(long_df[(i-1):i, "scoreA"] ==
                                 min(long_df[i-1, "scoreA"], long_df[i, "scoreA"])), 
                                 "solver"]
  wide_df[i/2, "scoreA_s1"] <- long_df[i-1, "scoreA"]
  wide_df[i/2, "scoreA_s2"] <- long_df[i, "scoreA"]
  wide_df[i/2, "scoreB_s1"] <- long_df[i-1, "scoreB"]
  wide_df[i/2, "scoreB_s2"] <- long_df[i, "scoreB"]
  wide_df[i/2, "group"] <- long_df[i, "group"]
  wide_df[i/2, "size"] <- long_df[i, "size"]
}

R from long to wide dataframe with real valued columns

2 Answers2