2

I have a data frame that has several hundred thousand rows and 6 columns. Each column contains IDs (there are about 500 unique IDs in total). I would like to convert this data frame into a large table/matrix with each unique ID having its own column, and having a -1, 0, or 1 for each entry corresponding to the following logic: 0 if the ID is not present, -1 if the ID is in the first 3 columns, and 1 if the ID is in the last 3 columns.

I can use the brute force approach, looping through each row, one-by-one, but I'm looking for a faster and more polished way to do this. My preference would be to use a dplyr solution, assuming one exists. I'm guessing there's also a nifty way to do this with data.table, or even just a simple base R approach. Any help would be appreciated!

Thanks in advance. Here's an example of what my data looks like:

df <- data.frame(matrix(c("XX001","XX002","XX003","XX007","XX008","XX009",
                          "XX001","XX004","XX005","XX006","XX010","XX008",
                          "XX003","XX002","XX005","XX008","XX009","XX010",
                          "XX002","XX005","XX003","XX009","XX007","XX010",
                          "XX001","XX002","XX004","XX007","XX009","XX006"),
                        nrow=5, ncol=6, byrow=1))
names(df) <- c("ID_X1","ID_X2","ID_X3","ID_Y1","ID_Y2","ID_Y3")
df

> df
  ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX002 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006

And here's what I would like my output to look like:

> yay
  XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010  ... XX500
1    -1    -1    -1     0     0     0     1     1     1     0  ...     0
2    -1     0     0    -1    -1     1     0     1     0     1  ...     0
3     0    -1    -1     0    -1     0     0     1     1     1  ...     0
4     0    -1    -1     0    -1     0     1     0     1     1  ...     0
5    -1    -1     0    -1     0     1     1     0     1     0  ...     0
donkeyshark
  • 111
  • 1
  • 8

2 Answers2

2

Try this:

library(dplyr)
library(tidyr)
#Code
newdf <- df %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  group_by(id) %>%
  mutate(Val=ifelse(row_number() %in% 1:3,-1,
                    ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
  select(-name) %>%
  pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
  ungroup() %>%
  select(-id)

Output:

# A tibble: 5 x 10
  XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1    -1    -1    -1     0     0     0     1     1     1     0
2    -1     0     0    -1    -1     1     0     1     0     1
3     0    -1    -1     0    -1     0     0     1     1     1
4     0    -1    -1     0    -1     0     1     0     1     1
5    -1    -1     0    -1     0     1     1     0     1     0

Update: As OP is having issues with duplicated values, here a possible sketch to tackle the task. First a dummy data:

df2
  ID_X1 ID_X2 ID_X3 ID_Y1 ID_Y2 ID_Y3
1 XX001 XX001 XX003 XX007 XX008 XX009
2 XX001 XX004 XX005 XX006 XX010 XX008
3 XX003 XX002 XX005 XX008 XX009 XX010
4 XX002 XX005 XX003 XX009 XX007 XX010
5 XX001 XX002 XX004 XX007 XX009 XX006

We can see the first row has a duplicated. So, we can create an index to make difference between duplicated values. Here the code:

#Code 2
newdf <- df2 %>% mutate(id=row_number()) %>%
  pivot_longer(-id) %>%
  group_by(id) %>%
  mutate(Val=ifelse(row_number() %in% 1:3,-1,
                    ifelse(row_number() %in% ((n()-3):n()),1,0))) %>%
  ungroup() %>%
  group_by(id,value) %>%
  mutate(value=paste0(value,'.',row_number())) %>%
  select(-name) %>%
  pivot_wider(names_from = value,values_from=Val,names_sort = T,values_fill = 0) %>%
  ungroup() %>%
  select(-id)

Output:

# A tibble: 5 x 11
  XX001.1 XX001.2 XX002.1 XX003.1 XX004.1 XX005.1 XX006.1 XX007.1 XX008.1 XX009.1 XX010.1
    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1      -1      -1       0      -1       0       0       0       1       1       1       0
2      -1       0       0       0      -1      -1       1       0       1       0       1
3       0       0      -1      -1       0      -1       0       0       1       1       1
4       0       0      -1      -1       0      -1       0       1       0       1       1
5      -1       0      -1       0      -1       0       1       1       0       1       0

In this way, the duplicated value has been kept.

Duck
  • 39,058
  • 13
  • 42
  • 84
  • This works great on my example dataset above, but when I try to use it on my real dataset, I'm running into problems and get the following error. Any idea why? Warning: Values are not uniquely identified; output will contain list-cols. * Use `values_fn = list` to suppress this warning. * Use `values_fn = length` to identify where the duplicates arise * Use `values_fn = {summary_fun}` to summarise duplicates Error: Can't convert to . Run `rlang::last_error()` to see where the error occurred. – donkeyshark Nov 24 '20 at 19:15
  • Perhaps of note is the fact that there are many duplicate rows in this dataset, which need to be maintained. When I run the code on a small subset of my data, it does work, but when I run it across the full 500k+ rows, it gives me the error above. – donkeyshark Nov 24 '20 at 19:51
  • @donkeyshark I see, maybe could you add a sample of data replicating the duplicated values so that I can adjust the code for you? – Duck Nov 24 '20 at 20:17
  • @donkeyshark Hi dear. I have added a little example with a duplicated value, please check and let me know if that could work! – Duck Nov 24 '20 at 21:00
  • Wow! This worked! Brilliant! And you figured it out without even being able to reproduce my error, haha. I can follow everything in your code except for the part where you define 'Val' using the row_number() function. How does this work? When I run through the code step-by-step, it seems like there should be a ton of rows, but your code only looks at rows 1:3, then 3:6 (I assume you meant to type n()-2 rather than n()-3 but it still works either way. Can you explain what is happening in this line? – donkeyshark Nov 24 '20 at 21:44
  • @donkeyshark Great! What you said is correct. That line of code replicates what you want. This is looking at the beginning and bottom and then assign the code you want. Certainly, yourself got the explanation for that. Hope that helped :) Also consider potentially accepting this answer if this was helpful! – Duck Nov 24 '20 at 21:47
  • I figured out what caused my error as a result of your code. There are a few lines in my data where the same ID is present on the left side and again on the right side of the same row. Your code treats those as 2 separate IDs. I think I need to go back into my data and see if that is an error or not. But either way, this is super helpful. Thanks again! – donkeyshark Nov 24 '20 at 21:58
  • @donkeyshark That is great! Also if you accepted the other answer, you could upvote mine, I solved your issue from two perspectives. – Duck Nov 24 '20 at 22:04
1

This is a vectorized solution:

id <- as.character(as.matrix(df)) %>% unique(.)
id <- id[order(id)]

match_id <- function(x) match(id,x)

yay <- as.data.frame(t(apply(df,1,match_id)))
names(yay) <- id
yay[yay<=3] <- -1
yay[yay>3] <- 1
yay[is.na(yay)] <- 0

Output:

yay
#   XX001 XX002 XX003 XX004 XX005 XX006 XX007 XX008 XX009 XX010
# 1    -1    -1    -1     0     0     0     1     1     1     0
# 2    -1     0     0    -1    -1     1     0     1     0     1
# 3     0    -1    -1     0    -1     0     0     1     1     1
# 4     0    -1    -1     0    -1     0     1     0     1     1
# 5    -1    -1     0    -1     0     1     1     0     1     0
Marcos Pérez
  • 1,260
  • 2
  • 7