0

There is some sequence data to be compared. The expected output is the distance matrix which shows how similar each sequence is to the others. Previously, I used ngram.NGram.compare in Python and now I want to switch to R. I found ngram and biogram package but I was unable to find the exact function which generate the expected output.

Assume this is the data

a <- c("ham","bam","comb")

The output should be like this (distance between each item):

#      ham    bam   comb
#ham    0     0.5   0.83
#bam   0.5     0     0.6
#comb  0.83   0.6     0

It is the equivalent Python code for the output:

a = ["ham","bam","comb"]
import ngram
[(1 - ngram.NGram.compare(a[i],a[j],N=1))  
                          for i in range(len(a)) 
                          for j in range((i+1),len(a)) ]
Hadij
  • 3,661
  • 5
  • 26
  • 48
  • Are the sequences of the same length? – missuse Mar 12 '18 at 12:45
  • @missuse The sequence that I have are from the same length. However, the above example is not. It is better to support same or different length. I don't think than 1-gram is length sensitive. – Hadij Mar 12 '18 at 13:37
  • Could you explain in detail how the comparison is done using 1-gram? – missuse Mar 12 '18 at 13:44
  • @missuse I made it clear here: https://stackoverflow.com/questions/49252396/ngram-representation-and-distance-matrix-in-r – Hadij Mar 13 '18 at 09:23

1 Answers1

1

you could use stringdistmatrix from the stringdist package. Check the stringdist-metrics documentation which metrics are available.

a <- c("ham","bam","comb")
stringdist::stringdistmatrix(a, a, method = "jaccard")

          [,1] [,2]      [,3]
[1,] 0.0000000  0.5 0.8333333
[2,] 0.5000000  0.0 0.6000000
[3,] 0.8333333  0.6 0.0000000
phiver
  • 23,048
  • 14
  • 44
  • 56
  • Thank you very much. Perhaps the output is not good. I know Jaccard method. I want to use ngram to compare two sequences. May the output be different from what I wrote above. – Hadij Mar 12 '18 at 12:35
  • In that case you will have to give a better example of what you exactly want to compare and how it should look like. – phiver Mar 12 '18 at 12:44
  • I made it clear here. Now the question is different: https://stackoverflow.com/questions/49252396/ngram-representation-and-distance-matrix-in-r – Hadij Mar 13 '18 at 09:24