1

I am attempting to aggregate my data to find correlations/patterns, and want to discover how and where data may correlate. Specifically, I want to identify how many times an id (here called 'item') appear together. Is there a way to find how many times each (id) appear together in a row?

This is for a larger data.frame that has already been cleaned and aggregated based on this particular inquiry. In the past, I have tried to apply multiple aggregation, summation and filter functions from packages like 'data.table','dplyr', and 'tidyverse' but cannot quite get what I am looking for.

In section 3(Show some code) I have provided a minimal reproducible example:

set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:12350,2000,replace = T)
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)

Using the examples here,I expected the output to ID all the combinations where names were aggregated to a number and show the n (value) - expecting results to resemble something like:

Pair       value
Bob, Tim     2
Bob, Jackie  4
Bob, Angie   0

This output (what I am hoping to get) would tell me that in the entire df, there are 2 times that Bob and Tim and 4 times that Bob and Jackie both have the same number.

but the actual output is:

Error: Each row of output must be identified by a unique combination of keys.

Keys are shared for 2000 rows:
* 9, 23, 37, 164, 170, 180, 211...

Update: I thought of a..creative(?) solution - but hope someone can help with expedting it. I can locate all the numbers (column1) that are shared between two names using the following:

x1<-sample_data %>% dplyr::filter(item=="Bob")
x2<-sample_data %>% dplyr::filter(item=="Tim")
Bob<-x1[,1]
Tim<-x2[,1]
Reduce(intersect, list(Bob,Tim))

output:

[1] "12345" "12348" "12350" "12346" "12349" "12347"

Like I said, this is very time consuming and would require creating a plethora of vectors and intersecting each(e.g. 1 vector for each name) and multiple combinations.

OctoCatKnows
  • 399
  • 3
  • 17
  • Yes, separated I think. Just want to: id the number that is shared and the item (names) that share it. – OctoCatKnows May 29 '19 at 22:16
  • The other answers from your link would have likely worked if you had added the ```distinct``` to your pipeline. Your error shows that it had issues spreading the data because there were duplicates. – Cole May 31 '19 at 10:19

2 Answers2

3
set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(cbind(number,item), stringsAsFactors = FALSE)

library(tidyverse)
sample_data %>%
  # find out unique rows
  distinct() %>%
  # nest the data frame into nested tibble, so now you have
  # a "data" column, which is a list of small data frames.
  group_nest(number) %>%
  # Here we use purrr::map to modify the list column. We want each 
  # combination counts only once despite the order, so we use sort. 
  mutate(data = map_chr(data, ~paste(sort(.x$item), collapse = ", "))) %>%
  # the last two steps just count the numbers
  group_by(data) %>%
  count()

# A tibble: 21 x 2
# Groups:   data [21]
   data                         n
   <chr>                    <int>
 1 Angie                      336
 2 Angie, Bob                   8
 3 Angie, Bob, Christopher      2
 4 Angie, Bob, Jackie           1
 5 Angie, Christopher          16
 6 Angie, Jackie                9
 7 Angie, Tim                  10
 8 Bob                        331
 9 Bob, Christopher            12
10 Bob, Christopher, Jackie     1
# … with 11 more rows

One possible solution

Hao
  • 7,476
  • 1
  • 38
  • 59
  • Hello and thanks for helping me out here. I am having trouble understanding the output. It appears as a df with one obs: Unsure how to decipher this:? Classes ‘grouped_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1 obs. of 2 variables: – OctoCatKnows May 29 '19 at 21:42
  • @BuffsGrad16 Isn't this what you are looking for? I just modified your example and included that in my updated answer. You can check it out. – Hao May 29 '19 at 23:54
  • any idea of an alternative to dplyr::group_nest? I was hoping this would work by my company is slow to pull down updates to packages and the version (dplyr) we have doe snot have the function (also, recreating the function fails) – OctoCatKnows May 30 '19 at 11:07
  • 1
    @BuffsGrad16 `%>% group_by(number) %>% nest() %>%` – Hao May 30 '19 at 14:31
1

Here's a base R solution which relies on table -> aggregate and potentially an inefficient way to paste the names together using apply.

tab_data <-  data.frame(unclass(table(unique(sample_data))))
#table results in columns c(Angie.1, Bob.1, ...) - this makes it look better
names(tab_data) = sort(random.people) 

library(network)
plot.network.default(as.network(tab_data))

tab_data$n <- 1

agg_data <- aggregate(n~., data = tab_data, FUN = length)
agg_data$Pair <- apply(agg_data[, -length(agg_data)], 1, function(x) paste(names(x[x!=0]), collapse = ', '))


agg_data[order(agg_data$Pair), c('Pair', 'n') ]

                            Pair   n
1                          Angie 336
3                     Angie, Bob   8
7        Angie, Bob, Christopher   2
11            Angie, Bob, Jackie   1
5             Angie, Christopher  16
9                  Angie, Jackie   9
15                    Angie, Tim  10
2                            Bob 331
6               Bob, Christopher  12
... truncated ...

As far as performance, on this relatively small data set, it's around 9x faster than the dplyr solution:

Unit: milliseconds
           expr     min       lq     mean   median       uq      max neval
  base_solution  9.4795  9.65215 10.80984  9.87625 10.32125  46.8230   100
 dplyr_solution 78.6070 81.72155 86.47891 83.96435 86.40495 200.7784   100

Data

set.seed(1234)
random.people<-c("Bob","Tim","Jackie","Angie","Christopher")
number=sample(12345:22350,2000,replace = T) # I edited ur number here.
item=sample(random.people,2000,replace=T)

sample_data <- data.frame(number,item, n = 1L, stringsAsFactors = FALSE)
Cole
  • 11,130
  • 1
  • 9
  • 24
  • Thanks @Cole. I am going to give this a try later today. – OctoCatKnows Jun 01 '19 at 14:54
  • Wow, I like this a lot - it is clean. Is it possible to make this in to a network matrix you think? I am having a heck of a time with the data b/c it is too big and the network plot (nodes,edges) are so high it is unreadable. I have tried to apply it just to the grouped output (such as that seen in your agg_data) but I am unable to form edges as the output for nodes is two (Pair, N) – OctoCatKnows Jun 02 '19 at 15:00
  • You may want to ask another question, but is ```network::plot.network.default(network::as.network(tab_data))``` closer to your desired result? I tested the call prior to the ```tab_data$n <- 1``` line and I had a network graph with 5 main people as expected. However, The graph looked pretty complicated. – Cole Jun 02 '19 at 20:36
  • Hmm, odd - not getting anything back (other than 'Error in plot.network.default: no layout function for mode fruchtermanreingold' – OctoCatKnows Jun 03 '19 at 10:42
  • I edited the code to add the call - see lines 5 and 6. – Cole Jun 03 '19 at 10:54