finding similar strings in each row of two different data frame

Question

I would like to check two data set. one data has many columns (this example has two columns df1) and one data has one column (df2)

At first, I want to check the first column of df1 each row with all part of df2 if any similar part is found, then the row number of df1 and df2 is written

for example Column 1 of df1 has two similar part of the row to df2 Q9Y6Q9 in row 3 of df1 with Q9Y6Q9 in row 4 of df2 . so the output is 3-4 , the same for others

I think your grammar is throwing me off and perhaps English is not your native language, but I'm not quite understanding. However, I think this involves using a `for` loop to loop through your two data frames, row by row and an `if` statement inside your `for` loop that checks if your condition is met and makes the edit if so. — giraffehere, Feb 25 '16 at 22:24
Not sure what it is you are really after since your specification does not match expected output. But, here is a line of code that finds any 'substrings' in df1$sample_1 column in corresponding rows in df2. `apply(df1, 1, function(x) grep(paste(unlist(strsplit(x[1], ';')), collapse = '|'), df2$subunits..UniProt.IDs.))` — Gopala, Feb 25 '16 at 22:40
@Gopala thanks but the below is the right answer. however, I am still waiting for some modification , if you have any idea , you can adjust his answer — nik, Feb 27 '16 at 08:22

Karsten W. · Accepted Answer · 2016-02-27T16:18:56.560

Maybe you should normalize your data first. For instance, you could do:

normalize <- function(x, delim) {
    x <- gsub(")", "", x, fixed=TRUE)
    x <- gsub("(", "", x, fixed=TRUE)
    idx <- rep(seq_len(length(x)), times=nchar(gsub(sprintf("[^%s]",delim), "", as.character(x)))+1)
    names <- unlist(strsplit(as.character(x), delim))
    return(setNames(idx, names))
}

This function can applied to each column of df1 as well as the lookup table df2:

s1 <- normalize(df1[,1], ";")
s2 <- normalize(df1[,2], ";")
lookup <- normalize(df2[,1], ",")

With this normalized data, it is easy to generate the output you are looking for:

process <- function(s) {
    lookup_try <- lookup[names(s)]
    found <- which(!is.na(lookup_try))
    pos <- lookup_try[names(s)[found]]
    return(paste(s[found], pos, sep="-"))
    #change the last line to "return(as.character(pos))" to get only the result as in the comment
}

process(s1)
# [1] "3-4" "4-1" "5-4"
process(s2)
# [1] "2-4"  "3-15" "7-16"

The output is not exactly the same as in the question, but this may be due to manual lookup errors.

In order to iterate over all columns of df1, you could use lapply:

res <- lapply(colnames(df1), function(x) process(normalize(df1[,x], ";")))
names(res) <- colnames(df1)

Now, res is a list indexed by the column names of df1:

res[["sample_1"]]
# [1] "4" "1" "4"

I don't know how to thank you. I liked your answer already. I have two requests. can I make 1- a command that generate s1 to s number of columns of df1 ? so I don't need to make them manually. 2- only give me the rows number of df2 . for example process(s1) # [1] "4" "1" "4" . — nik, Feb 26 '16 at 14:45

finding similar strings in each row of two different data frame

1 Answers1

Linked