Conditional for loop in R not recognizing conditional statement?

Question

Assume that I have the following similar data structure, where doc_id is the document identifier, text_id is the unique text/version identifier and text is a character string:

df <- cbind(doc_id=as.numeric(c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)), 
                text_id=as.numeric(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)), 
                  text=as.character(c("string1", "str2ing", "3string", 
                                      "string6", "s7ring", "string8", 
                                      "string9", "string10")))

What I am attempting to do in the loop structure is do string edit-distance comparisons, but only for different versions of the same documents. In short, I want to find matching doc_ids and pair-wise compare only different versions (text_ids) of the same document.

#Results matrix
result <- matrix(ncol=10, nrow=10)

#Loop
i=1
for (j in 1:length(df[,2])) {
  for (i in 1:length(df[,2])) {
#Conditional Statements
    if(df[i,1]==df[j,1]){
      result[i,j]<-levenshteinDist(df[j,3], df[i,3])}
    else(result[i,j]<-"Not Compared")
  }
  print(result[i,j])
  flush.console()
}

Returns:

[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "0"

The levenshteinDist() function can be found in the RecordLinkage package, but a similar function is also bundled in the utils package as adist()

My question is: why is my first conditional statement (if) being ignored, and only the else portion being returned?

Any further advice on coding or processing time efficiency gains will be greatly appreciated.

Have you checked the result matrix, it seems like it is working, the output you saw here is only for last row (or column)..look your i, j ordering. and also you are comparing same item with `if(df[i,1]==df[j,1])` did you meant to do `if(df[i,1]==df[j,2])`? — Ananta, Nov 01 '13 at 00:20
I only want to make comparisons (string distances) between different versions of the 'same' document (equivalent doc_ids but varying text_ids) ... so no. But you were correct in suggesting that I change the output structure, as per @Maiasaura original recommendation. — DV Hughes, Nov 01 '13 at 14:23

score 0 · Answer 1 · answered Nov 01 '13 at 00:19

For starters, if I understand your objective the if-statement should read if (df[i,1]==df[j,2]), so that you are making comparisons between the values of the two columns.

The problem here isn't that your conditional is being ignored, but rather you're going about outputting your results incorrectly. result here is made up of a 10x10 matrix, but you are only printing result[i,j] inside the loop which iterates over j. I think the code should look more like this:

for (i in 1:length(df[,2])) {
    for (j in 1:length(df[,2])) {

        if(df[i,1]==df[j,2]) {
            result[i,j]<-adist(df[j,3], df[i,3])
        } else {
            (result[i,j]<-"Not Compared")
        }
    }
}

This will build the matrix of results, and you can then view the results of all 100 comparisons as you desire.

I only want to make comparisons (string distances) between different versions of the 'same' document (equivalent doc_ids but varying text_ids). — DV Hughes, Nov 01 '13 at 00:27

score 0 · Accepted Answer · answered Nov 01 '13 at 00:20

You're not outputting correctly. Run this version and see the comparisons happening in place. Comment out the message() once you are satisfied that everything is working correctly.

library(RecordLinkage)

df <- structure(c("1", "1", "2", "2", "3", "4", "4", "4", "5", "6", 
"1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "string1", 
"str2ing", "3string", "string6", "s7ring", "string8", "string9", 
"string10", "string1", "str2ing"), .Dim = c(10L, 3L), .Dimnames = list(
    NULL, c("doc_id", "text_id", "text")))

result <- matrix(ncol = 10, nrow = 10)
# nrow() and ncol() are more elegant ways of getting row/column counts.
for(j in 1:nrow(df)) {
    for(i in 1:nrow(df)) {
        message(sprintf("comparing i=%s (%s), j=%s (%s)\n", j, df[i, 1], i, df[j, 1]))
        if(identical(df[i, 1], df[j, 1])) {
            result[i, j] <- levenshteinDist(df[j, 3], df[i, 3])
        } else {
            result[i, j] <- "Not Compared"
        }
           # printing inside the inner for loop
        print(result[i, j])
    }

}

thank you for offering the suggestion the print on each comparison. Also, the identical function makes much more sense in terms of generalizing what my end goal is. — DV Hughes, Nov 01 '13 at 00:28

Conditional for loop in R not recognizing conditional statement?

2 Answers2