Assume that I have the following similar data structure, where doc_id is the document identifier, text_id is the unique text/version identifier and text is a character string:
df <- cbind(doc_id=as.numeric(c(1, 1, 2, 2, 3, 4, 4, 4, 5, 6)),
text_id=as.numeric(c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)),
text=as.character(c("string1", "str2ing", "3string",
"string6", "s7ring", "string8",
"string9", "string10")))
What I am attempting to do in the loop structure is do string edit-distance comparisons, but only for different versions of the same documents. In short, I want to find matching doc_ids and pair-wise compare only different versions (text_ids) of the same document.
#Results matrix
result <- matrix(ncol=10, nrow=10)
#Loop
i=1
for (j in 1:length(df[,2])) {
for (i in 1:length(df[,2])) {
#Conditional Statements
if(df[i,1]==df[j,1]){
result[i,j]<-levenshteinDist(df[j,3], df[i,3])}
else(result[i,j]<-"Not Compared")
}
print(result[i,j])
flush.console()
}
Returns:
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "Not Compared"
[1] "0"
The levenshteinDist()
function can be found in the RecordLinkage
package, but a similar function is also bundled in the utils
package as adist()
My question is: why is my first conditional statement (if) being ignored, and only the else portion being returned?
Any further advice on coding or processing time efficiency gains will be greatly appreciated.