1

Suppose I have something like the following vector:

text <- as.character(c("string1", "str2ing", "3string", "stringFOUR", "5tring", "string6", "s7ring", "string8", "string9", "string10"))

I want to execute a loop that does pair-wise comparisons of the edit distance of all possible combinations of these strings (ex: string 1 to string 2, string 1 to string 3, and so forth). The output should be in a matrix form with rows equal to number of strings and columns equal to number of strings.

I have the following code below:

#Matrix of pair-wise combinations
m <- expand.grid(text,text)

#Define number of strings
n <- c(1:10)

#Begin loop; "method='osa'" in stringdist is default
for (i in 1:10) {
  n[i] <- stringdist(m[i,1], m[i,2], method="osa")
  write.csv(data.frame(distance=n[i]),file="/File/Path/output.csv",append=TRUE)
  print(n[i])
  flush.console()
}

The stringdist() function is from the stringdist{} package but the function is also bundled in the base utils package as adist()

My question is, why is my loop not writing the results as a matrix, and how do I stop the loop from overwriting each individual distance calculation (ie: save all results in matrix form)?

Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
DV Hughes
  • 305
  • 2
  • 5
  • 22

1 Answers1

0

I would suggest using stringdistmatrix instead of stringdist (especially if you are using expand.grid)

 res <- stringdistmatrix(text, text)
 dimnames(res) <- list(text, text)  
 write.csv(res, "file.csv")

As for your concrete question: "My question is, why is my loop not writing the results as a matrix"
It is not clear why you would expect the output to be a matrix? You are calculating an element at a time, saving it to a vector and then writing that vector to disk.

Also, you should be aware that the arugments of write.csv are mostly useless (they are there, I believe, just to remind the user of what the defaults are). Use write.table instead

If you want to do this iteratively, I would do the following:

# Column names, outputted only one time
write.table(rbind(names(data.frame(i=1, distance=n[1])))
            ,file="~/Desktop/output.csv",append=FALSE   # <~~ Don't append for first run.
             , sep=",", col.names=FALSE, row.names=FALSE)

for (i in 1:10) {
  n[[i]] <- stringdist(m[i,1], m[i,2], method="osa")
  write.table(data.frame(i=i, distance=n[i]),file="~/Desktop/output.csv"
              ,append=TRUE, sep=",", col.names=FALSE, row.names=FALSE)
  print(n[[i]])
  flush.console()
}
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • R Sessions abort with large strings (due to RAM/memory issues). Hence why I am using the matrix notation, stringdist() as opposed to stringdistmatrix(), and periodically saving & printing results throughout the loop execution – DV Hughes Aug 05 '13 at 23:06
  • @DVHughes that makes sense. Try using `write.table` instead (see edit) – Ricardo Saporta Aug 05 '13 at 23:58