3

Original data:

df <- structure(list(ID_client = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1_", "2_", "3_", "4_"), class = "factor"), Connected = c(1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), Year = c(2010L, 2010L, 2010L, 2010L, 2015L, 2015L, 2015L, 2015L)), class = "data.frame", row.names = c(NA, -8L))

Original data:

`ID_client Connected  Year
1_            1      2010
2_            1      2010
3_            1      2010
4_            0      2010
1_            1      2015
2_            0      2015
3_            1      2015
4_            0      2015`

My intention is to create the following data:

`Year ID_client    1_   2_   3_   4_
2010     1_       0    1    1    0
2010     2_       1    0    1    0
2010     3_       1    1    0    0
2010     4_       0    0    0    0
2015     1_       0    0    1    0
2015     2_       0    0    0    0
2015     3_       1    0    0    0
2015     4_       0    0    0    0`

In other words, a matrix that express that in, for instance, 2010 clients 1_, 2_, and 3_ were connected, while the other one was not. Importantly, I do not consider someone to be connected with herself.

I have tried the following code:

df %>%
  group_by(Year, Connected) %>%
  mutate(temp = rev(ID_client)) %>%
  pivot_wider(names_from = ID_client, 
          values_from = Connected, 
          values_fill = list(Connected = 0)) %>%
  arrange(Year, temp)

This code does not reproduce what I need. Instead, this is the result:

`Year ID_client    1_   2_   3_   4_
2010     1_       0    0    1    0
2010     2_       0    1    0    0
2010     3_       1    0    0    0
2010     4_       0    0    0    0
2015     1_       0    0    1    0
2015     2_       0    0    0    0
2015     3_       1    0    0    0
2015     4_       0    0    0    0`
camille
  • 16,432
  • 18
  • 38
  • 60
Miranda
  • 148
  • 13
  • What have you already tried? Please show some code, and create a [minimal, reproducible and verifiable example](https://stackoverflow.com/help/minimal-reproducible-example). – Optimistic Peach Sep 30 '19 at 01:21
  • It's something like a co-occurrence table - https://stackoverflow.com/questions/19977596/how-do-i-calculate-the-co-occurrence-in-the-table - but within each year. – thelatemail Sep 30 '19 at 02:19

2 Answers2

1

We can group_by Year and create a new column with ID_client values which has Connected == 1 in each group except for the current value. We complete the missing levels and then cast the data to wide format.

library(tidyverse)

df %>%
  group_by(Year) %>%
  mutate(temp = map(ID_client, ~setdiff(ID_client[Connected == 1], .x))) %>%
  unnest(cols = temp) %>%
  complete(temp = unique(ID_client), fill = list(Connected = 0)) %>%
  mutate(ID_client  = coalesce(as.character(ID_client), temp)) %>%
  pivot_wider(names_from = temp, 
              values_from = Connected, 
              values_fill = list(Connected = 0)) %>%
  arrange(Year, ID_client)

#   Year ID_client  `1_`  `2_`  `3_`  `4_`
#  <int> <chr>     <dbl> <dbl> <dbl> <dbl>
#1  2010 1_            0     1     1     0
#2  2010 2_            1     0     1     0
#3  2010 3_            1     1     0     0
#4  2010 4_            0     0     0     0
#5  2015 1_            0     0     1     0
#6  2015 2_            0     0     0     0
#7  2015 3_            1     0     0     0
#8  2015 4_            0     0     0     0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • I do not think your code is reproducing what I wanted to do. When Client_ID is 1_, the column "1_" received the value 1. Instead, it should have been the column "2_". – Miranda Sep 30 '19 at 01:32
  • The issue with your answer is that it is dependent on this particular toy data. For instance, if the new data is df <- structure(list(ID_client = structure(c(1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L), .Label = c("1_", "2_", "3_", "4_"), class = "factor"), Connected = c(1L, 1L, 1L, 0L, 1L, 0L, 1L, 0L), Year = c(2010L, 2010L, 2010L, 2010L, 2015L, 2015L, 2015L, 2015L)), class = "data.frame", row.names = c(NA, -8L)) instead, it does not work. – Miranda Sep 30 '19 at 01:54
  • @Miranda ok..I get it now. I had misunderstood the question earlier. I have updated the answer. Can you check if it works now ? – Ronak Shah Sep 30 '19 at 02:43
  • Yes, it does. Thank you so much! – Miranda Sep 30 '19 at 03:24
1

You can use a self-join, i.e. an inner join of the data to itself. Join by the pieces of information that mark off a combination of clients: this would be the values in Year and Connected. Since your desired output has zeroes on its diagonal, filter to remove cases where the two IDs are the same.

As you can see, I haven't yet transitioned to the pivot_wider version of tidyr, but this should be adaptable. In spread, specify that unused factor levels shouldn't be dropped so you don't lose the ID 4.

library(dplyr)
library(tidyr)

inner_join(df, df, by = c("Year", "Connected")) %>%
  filter(Connected == 1, ID_client.x != ID_client.y) %>%
  spread(key = ID_client.y, value = Connected, fill = 0, drop = F) %>%
  arrange(Year) 
#>   ID_client.x Year 1_ 2_ 3_ 4_
#> 1          1_ 2010  0  1  1  0
#> 2          2_ 2010  1  0  1  0
#> 3          3_ 2010  1  1  0  0
#> 4          4_ 2010  0  0  0  0
#> 5          1_ 2015  0  0  1  0
#> 6          2_ 2015  0  0  0  0
#> 7          3_ 2015  1  0  0  0
#> 8          4_ 2015  0  0  0  0
camille
  • 16,432
  • 18
  • 38
  • 60