I have a large data frame, where I use combn to get the unique combinations between each pair of variables. So the output looks as follows ( with a small example of only 7 variables). The problem I have is that I wish to identify which features have more than say 50% of their correlations exceeding 0.5; however, the output from combn has the variable occuring in both columns. E.g. from the below Var2 occurs 5 times in total ( 4 in Variable_list1 and once in Variable_list2), but Var1 occurs 7 times in Variable_list1.
structure(list(Variable_list1 = c("Var1", "Var1", "Var1", "Var1",
"Var1", "Var1", "Var1", "Var2", "Var2", "Var2", "Var2", "Var3",
"Var3", "Var3", "Var3", "Var3", "Var4", "Var4", "Var4", "Var4",
"Var5", "Var5", "Var5", "Var6", "Var6", "Var7"), Variable_list2 = c("Var2",
"Var3", "Var4", "Var5", "Var6", "Var7", "Var8", "Var4", "Var5",
"Var6", "Var7", "Var4", "Var5", "Var6", "Var7", "Var8", "Var5",
"Var6", "Var7", "Var8", "Var6", "Var7", "Var8", "Var7", "Var8",
"Var8"), Coefficient = c("0.771428571428571", "0.839285714285714",
"0.839285714285714", "0.807142857142857", "0.775", "0.807142857142857",
"0.739285714285714", "0.964285714285714", "0.982142857142857",
"0.935714285714286", "0.953571428571429", "0.925", "0.946428571428571",
"0.957142857142857", "0.975", "0.921428571428571", "0.985714285714286",
"0.921428571428571", "0.935714285714286", "0.964285714285714",
"0.932142857142857", "0.957142857142857", "0.978571428571428",
"0.982142857142857", "0.95", "0.960714285714286")), row.names = c(1L,
2L, 3L, 4L, 5L, 6L, 7L, 9L, 10L, 11L, 12L, 14L, 15L, 16L, 17L,
18L, 19L, 20L, 21L, 22L, 23L, 24L, 25L, 26L, 27L, 28L), class = "data.frame")
I'm not sure how to approach this ( I understand you can use permutations to get the full list of combinations where the feature is occuring in the same number in Variable_list1, but I eliminate some rows anyway, so regardless the feature is occuring in different numbers in list 1 and 2).
I can use table(data$Variable_list1) and table(data$Variable_list2) and look at the number of times each feature is occurring, considering those variables unique to each column and also where the variable occurs in both. But how would I then introduce this as a new column into the results data frame above, to then calculate which features exceed a correlation coefficient threshold in more than 50% of their correlations, where each feature has different numbers of total correlations?
Any advice would be appreciated.
EDIT: an potential (inefficient) approach.
- subsection out the correlations exceeding threshold, 2)count the number of times the variable occurs across both columns of variables and save, 3)merge this with the number of total correlations these features have and get percentage, 4)keep the features > 50%