I am attempting to aggregate my data to find correlations/patterns, and want to discover how and where data may correlate. Specifically, I want to identify how many times an id (here called 'item') appear together. Is there a way to find how many times each (id) appear together in a row?
This is for a larger data.frame that has already been cleaned and aggregated based on this particular inquiry. In the past, I have tried to apply multiple aggregation, summation and filter functions from packages like 'data.table','dplyr', and 'tidyverse' but cannot quite get what I am looking for.
In section 3(Show some code) I have provided a minimal reproducible example:
set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:12350,2000,replace = T)
item=sample(random.people,2000,replace=T)
sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)
Using the examples here,I expected the output to ID all the combinations where names were aggregated to a number and show the n (value) - expecting results to resemble something like:
Pair value
Bob, Tim 2
Bob, Jackie 4
Bob, Angie 0
This output (what I am hoping to get) would tell me that in the entire df, there are 2 times that Bob and Tim and 4 times that Bob and Jackie both have the same number.
but the actual output is:
Error: Each row of output must be identified by a unique combination of keys.
Keys are shared for 2000 rows:
* 9, 23, 37, 164, 170, 180, 211...
Update: I thought of a..creative(?) solution - but hope someone can help with expedting it. I can locate all the numbers (column1) that are shared between two names using the following:
x1<-sample_data %>% dplyr::filter(item=="Bob")
x2<-sample_data %>% dplyr::filter(item=="Tim")
Bob<-x1[,1]
Tim<-x2[,1]
Reduce(intersect, list(Bob,Tim))
output:
[1] "12345" "12348" "12350" "12346" "12349" "12347"
Like I said, this is very time consuming and would require creating a plethora of vectors and intersecting each(e.g. 1 vector for each name) and multiple combinations.