0

I'm trying to build a Sankey diagram using the package in .

I think I've set up the dataset in the correct way, starting from a table. Here is my code:

M <- data.frame(as.matrix( table(as.character(df$q4.1),as.character(df$q4.2))))

M <- filter(M, Freq !=0)

q4.1 and q4.2 are two categorical variables with the same categories. I'm interesting in visualizing the flows going from the answers in q4.1 to q4.2.

nodes <- data.frame(
  name=c(as.character(M$Var1), as.character(M$Var2)) %>% unique()
)

The problem is, that given that the two variables have the same names, when I create the nodes, it includes only one set of "names".

Therefore, if somebody selected the same option in both questions, it ends up having the same source and target (see example below)

M$IDsource <- match(M$Var1, nodes$name)-1 

M$IDtarget <- match(M$Var2, nodes$name)-1

M

Var1             Var2 Freq IDsource IDtarget

No idea      No idea   16        7        7

As you can imagine, the resulting graph is odd, as people providing the same answers to both questions are shown as a circle returning to the same source.

Is renaming the categories in the second question the only possiblity to solve the problem? Or what I'm doing wrong?

Thanks for the support!

P.S. I already used the ggalluvial package within the ggplot2 to create the graph I want. However, it is not so nice (and exportable as htmlwidget) as the plot you can get with the networkD3 package, so I would like to recreate the same graph with networkD3. Here the successful code I used with the ggalluvial package.

ggplot(data= M, aes(axis1= Var1, axis2= Var2, y= Freq)) + scale_x_discrete(limits = c("Next 6 months", "Next 12-18 Months"), expand=c(0.1, 0.05)) + geom_alluvium() + geom_stratum() + geom_text(stat="stratum", infer.label = TRUE)
CJ Yetman
  • 8,373
  • 2
  • 24
  • 56
Michela
  • 5
  • 2

1 Answers1

4

In networkD3::sankeyNetwork, the index (row number) of the nodes data frame is the key between the links and the node data frame, not the 'name'. So you can have multiples of the same names in the nodes data frame, but if they're meant to identify different nodes, they must be on separate rows.

For instance, assuming you have data that looks something like this...

library(networkD3)
library(dplyr)

M <- expand.grid(Var1 = LETTERS[1:4], 
                 Var2 = LETTERS[1:4], 
                 stringsAsFactors = F)

M$Freq <- sample(1:100, nrow(M))

M
#>    Var1 Var2 Freq
#> 1     A    A   81
#> 2     B    A   84
#> 3     C    A   42
#> 4     D    A   71
#> 5     A    B    9
#> 6     B    B   79
#> 7     C    B   82
#> 8     D    B   76
#> 9     A    C   41
#> 10    B    C   63
#> 11    C    C   95
#> 12    D    C   61
#> 13    A    D   33
#> 14    B    D    2
#> 15    C    D   13
#> 16    D    D   38

add some identifier to the values so you can distinguish which question they're from, for instance...

M$Var1 <- paste0(M$Var1, '_q41')
M$Var2 <- paste0(M$Var2, '_q42')

M
#>     Var1  Var2 Freq
#> 1  A_q41 A_q42    9
#> 2  B_q41 A_q42   86
#> 3  C_q41 A_q42   62
#> 4  D_q41 A_q42   26
#> 5  A_q41 B_q42   44
#> 6  B_q41 B_q42   93
#> 7  C_q41 B_q42   36
#> 8  D_q41 B_q42   51
#> 9  A_q41 C_q42    6
#> 10 B_q41 C_q42    5
#> 11 C_q41 C_q42   21
#> 12 D_q41 C_q42   83
#> 13 A_q41 D_q42   40
#> 14 B_q41 D_q42   77
#> 15 C_q41 D_q42   20
#> 16 D_q41 D_q42   85

do the same thing you've done to get a unique list of the nodes and then match the links data frame to them...

nodes <- data.frame(
  name=c(as.character(M$Var1), as.character(M$Var2)) %>% unique()
)

M$IDsource <- match(M$Var1, nodes$name)-1

M$IDtarget <- match(M$Var2, nodes$name)-1

nodes
#>    name
#> 1 A_q41
#> 2 B_q41
#> 3 C_q41
#> 4 D_q41
#> 5 A_q42
#> 6 B_q42
#> 7 C_q42
#> 8 D_q42

M
#>     Var1  Var2 Freq IDsource IDtarget
#> 1  A_q41 A_q42    9        0        4
#> 2  B_q41 A_q42   86        1        4
#> 3  C_q41 A_q42   62        2        4
#> 4  D_q41 A_q42   26        3        4
#> 5  A_q41 B_q42   44        0        5
#> 6  B_q41 B_q42   93        1        5
#> 7  C_q41 B_q42   36        2        5
#> 8  D_q41 B_q42   51        3        5
#> 9  A_q41 C_q42    6        0        6
#> 10 B_q41 C_q42    5        1        6
#> 11 C_q41 C_q42   21        2        6
#> 12 D_q41 C_q42   83        3        6
#> 13 A_q41 D_q42   40        0        7
#> 14 B_q41 D_q42   77        1        7
#> 15 C_q41 D_q42   20        2        7
#> 16 D_q41 D_q42   85        3        7

if you don't want the question suffix to be visible in the Sankey output, you can remove it n0w that you've already matched the right index...

nodes$name <- sub('_q4[1-2]$', '', nodes$name)

then print...

sankeyNetwork(M, nodes, 'IDsource', 'IDtarget', 'Freq', 'name')

CJ Yetman
  • 8,373
  • 2
  • 24
  • 56