3

I am working with a set of 5 excel columns A,B,C,D,E of words "Aaa","Aab"... and I want to find the exact matches in all the columns (in R).

A   B   C   D   E  
Aaa Aaa Baa Aaa Ass
Aab Ccc Aaa Baa Aaa
Ccc Abc Ccc Ccc Ccc
... ... ... ... ... 

I create a vector for each column.
For that I have try a for loop with if and grep function.

<pre>
    for(i in A_vector) {
          if(grep("i", B_vector))
              if(grep("i", C_vector))
                  if(grep("i", D_vector))
                      if(grep("i", E_vector))
                          print(i)
      }
<code>

(but I only obtain the words in the first vector A_vector).
At the end I would like to have a vector with the words "Aaa", "Bbb"... that match in the 5 columns. I do not need the position of each match within the vector, just the words that are common to all the vectors.

 Result
    [1] "Aaa"
    [2] "Ccc"
    [n]  ...

Thank you in advance!

  • Hiho J L Carballo, wellcome to stackoverflow. I think you can archive this with some straight forward comparision. index = which(data$A == data$B & data$A == data$C % data$A == data$D $ data$A == data$E). This gives you the index for every row containing the same string in all colums. So with data[index] you get all the rows with same strings. With data$A[index] you get a vector with all the strings beeing equal over all colums. – TinglTanglBob Sep 12 '18 at 17:32
  • `grep` is excellent for matching patterns with regular expressions. For exact matches use `==` or `%in%`, depending on whether you need element-wise matching or not. However, for finding "elements in common" `intersect` is probably an even better bet. If your input is a data frame named `dd`, I think you're looking for `Reduce(inntersect, dd)` – Gregor Thomas Sep 12 '18 at 17:56
  • 1
    Suggested duplicate: [find elements in common for at least 2 vectors](https://stackoverflow.com/q/26175561/903061) – Gregor Thomas Sep 12 '18 at 18:01

3 Answers3

2

You are asking to find common elements between each list, not just duplicates in general. Duplicates below are Aaa, Ccc, Ddd, and Xxx, but the only element duplicated across any is Xxx. intersect() will accomplish this, with some double lapply functions.

A = list("Aaa", "Aaa", "Ccc", "Ccc")
B = list("Ddd", "Ddd", "Ddd", "Eee")
C = list("Fff", "Ggg", "Hhh", "Iii", "Jjj")
D = list("Kkk", "Lll", "Mmm", "Nnn", "Xxx")
E = list("Ppp", "Qqq", "Rrr", "Xxx")
Mylist <- list(A, B, C, D, E)

dupes <- unlist(lapply(Mylist, function(x) lapply(Mylist, function(y) intersect(x,y))))

unique(dupes[duplicated(dupes)])

[1] "Xxx"

To see where the intersections are, this will tell you that your 4th list has 1 element in common with your 5th list:

sapply(seq_len(length(Mylist)), function(x) sapply(seq_len(length(Mylist)), function(y) length(intersect(unlist(Mylist[x]), unlist(Mylist[y])))))

     [,1] [,2] [,3] [,4] [,5]
[1,]    2    0    0    0    0
[2,]    0    2    0    0    0
[3,]    0    0    5    0    0
[4,]    0    0    0    5    1
[5,]    0    0    0    1    4
Anonymous coward
  • 2,061
  • 1
  • 16
  • 29
-1

You could try something, though a little convoluted, using data.table:

library(data.table)

setDT(data)

data[, unlist(lapply(.SD, intersect, y = unique(A))), A][, .N, A][N == {ncol(dt) - 1}, A]
C-x C-c
  • 1,261
  • 8
  • 20
-1

Here is the edited answer based on your explanation that you want to find all the matches between at least two of the columns:

 Mylist <-list(A=c("Aaa","Aab","Ccc","Ddd"), B=c("Aaa","Ccc","Abc","Abd"), C=c("Baa","Aaa","Ccc","Abb","Ddd"), D=c("Aaa","Baa","Ccc","CBB","Baa"),E=c("Ass","Aaa","Ccc","Gef"))
 CharVec <-unlist(Mylist)
 unique(CharVec[duplicated(CharVec)])
Shirin Yavari
  • 626
  • 4
  • 6
  • I´m working with 5 variables, each one with a length of 150 to 650 words. The words are not repeated within the same variable. Using either `Reduce(intersect, dd)` or `Reduce(df, intersect)` I obtain an error "argument "init" is missing, with no default". Then I also try calling the `.init` as the first variable but without success. – J L Carballo Sep 13 '18 at 13:12
  • Commands and functions are case-specific. You are trying `Reduce` instead of `reduce`. You'll also need `library(purrr)`. @shirin this finds matches between all columns. OP is looking for matches between at least 2 columns. – Anonymous coward Sep 13 '18 at 15:40
  • @JLCarballo I edited my answer after your comment, please take a look and see if that's what you need – Shirin Yavari Sep 14 '18 at 17:30
  • Thank you @shirin ! It work well this way I obtain all the matches. I also checked with excel and there is no match for all the columns but some between two or 3. – J L Carballo Sep 17 '18 at 12:13
  • @JLCarballo this does not do what you are looking for. This ungroups all columns and finds any duplicates. For example `Mylist <-list(A=c("Aaa","Aaa","Ccc","Ccc"), B=c("Ddd","Ddd","Ddd","Eee"), C=c("Fff","Ggg","Hhh","Iii","Jjj"))`, this solution would return `Aaa, Ccc, and Ddd`, which are not duplicated across columns. – Anonymous coward Sep 18 '18 at 14:23