1

I'm creating MinHash and LSH in Octave/Matlab. But I'm trying to get a set (cell array or array) of shingles with k size from a given document and I don't know how to do it.

What I have right now is this simple code:

doc = fopen(document);
i = 1;
while (! feof(doc) )
  txt{i} = strread(fgetl(doc), '%s');
  i++;
endwhile
fclose(doc);

This creates a cell array with all the words from each line of the document, which is an argument the function that I'm trying to do.

nkt09
  • 56
  • 1
  • 4
  • And what is the problem or the question? – Andy Dec 13 '15 at 14:23
  • The problem is creating a MinHash and Locality-sensitive Hashing to find similar items (using Jaccard similarity), and for that I need to create a set of shingles from a document, which is given by argument. I want to return a set with shingles with k-size, for example a shingle of 5 means that each cell will have 5 words. – nkt09 Dec 13 '15 at 15:22
  • Please explain what you mean by the term shingles in this context. A specific input/output example would help. – Nick J Dec 14 '15 at 16:39

1 Answers1

0

This code may do the trick. It reads from a cell array and creates shingles (n-grams) of the specified size.

function S = shingles(txt, shingle_size)
  l = size(txt)(2) - shingle_size + 1;
  for i = 1:l
    t='';
    for j = i:(i + shingle_size - 2)
      t = strcat(t,txt{j},' ');
    end
    t = strcat(t, txt{i + shingle_size - 1});
    S{i} = t;
  end

You can test the code with the following example:

txt={'a','b','c'}
shingles(txt, 2)
S =
{
  [1,1] = ab
  [1,2] = bc
}
mariolpantunes
  • 1,114
  • 2
  • 15
  • 28