I have a data set from an online card sorting activity. Participants were presented with a random subset of Cards (from a larger set) and asked to create Groups of Cards they felt were similar to one another. Participants were able to create as many Groups as they liked and name the Groups whatever they wanted.
An example data set is something like this:
Data <- structure(list(Subject = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L,
3L, 3L, 3L, 3L, 3L), Card = structure(c(1L, 2L, 3L, 4L, 5L, 6L,
7L, 8L, 9L, 10L, 2L, 3L, 5L, 7L, 9L, 10L, 11L, 12L, 13L, 14L,
1L, 3L, 4L, 5L, 6L, 7L, 8L, 12L, 13L, 14L), .Label = c("A", "B",
"C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N"), class = "factor"),
Group = structure(c(1L, 2L, 3L, 4L, 1L, 3L, 3L, 5L, 2L, 5L,
1L, 2L, 1L, 3L, 1L, 4L, 4L, 2L, 3L, 1L, 1L, 2L, 1L, 2L, 3L,
2L, 1L, 2L, 2L, 3L), .Label = c("Cat1", "Cat2", "Cat3", "Cat4",
"Cat5"), class = "factor")), .Names = c("Subject", "Card",
"Group"), class = "data.frame", row.names = c(NA, -30L))
From these data I'd like to create a similarity matrix, ideally of proportion or percentage of total counts where items were grouped together.
Something like these:
Count:
A B C D E F G H I J K L M N
A 0 0 1 1 0 0 1 0 0 0 0 0 0
B 0 0 0 1 0 0 0 2 0 0 0 0 1
C 0 0 0 0 1 2 0 0 0 0 2 1 0
D 1 0 0 0 0 0 1 0 0 0 0 0 0
E 1 1 0 0 0 1 0 1 0 0 1 1 1
F 0 0 1 0 0 1 0 0 0 0 0 0 1
G 0 0 2 0 1 1 0 0 0 0 1 2 0
H 1 0 0 1 0 0 0 0 1 0 0 0 0
I 0 2 0 0 1 0 0 0 0 0 0 0 1
J 0 0 0 0 0 0 0 1 0 1 0 0 0
K 0 0 0 0 0 0 0 0 0 1 0 0 0
L 0 0 2 0 1 0 1 0 0 0 0 1 0
M 0 0 1 0 1 0 2 0 0 0 0 1 0
N 0 1 0 0 1 1 0 0 1 0 0 0 0
Every subject named their Groups differently, so it's not possible to index by Group.
In addition to counts, I'd also like to generate a similarity matrix that reports the percentage of participants, who were presented with a particular pair of Cards
, that grouped those two Cards
together.
From the example data set, this as a result:
A B C D E F G H I J K L M N
A 0 0 50 50 0 0 50 0 0 0 0 0 0
B 0 0 0 50 0 0 0 100 0 0 0 0 100
C 0 0 0 0 50 67 0 0 0 0 100 50 0
D 50 0 0 0 0 0 50 0 0 0 0 0 0
E 50 50 33 0 0 33 0 50 0 0 33 50 50
F 0 0 50 0 0 50 0 0 0 0 0 0 100
G 0 0 67 0 33 50 0 0 0 0 50 100 0
H 50 0 0 50 0 0 0 0 100 0 0 0 0
I 0 100 0 0 50 0 0 0 0 0 0 0 100
J 0 0 0 0 0 0 0 100 0 100 0 0 0
K 0 0 0 0 0 0 0 0 0 100 0 0 0
L 0 0 100 0 33 0 50 0 0 0 0 50 0
M 0 0 50 0 50 0 100 0 0 0 0 50 0
N 0 100 0 0 50 100 0 0 100 0 0 0 0
Any suggestions would be greatly appreciated!
Edit: While the answer below works for the example data. It doesn't seem to work for my actual data posted here: https://www.dropbox.com/s/mhqwyok0nmvt3g9/Sim_Example.csv?dl=0
For example, in those data I manually count 22 pairings of "Aircraft" and "Airport", which would be ~55%. But the answer below yields a count of 12 and 60%