0

I have implemented cosine similarity in Matlab like this. In fact, I have a two-dimensional 50-by-50 matrix. To obtain a cosine should I compare items in a line by line form.

for j = 1:50
    x = dat(j,:);
    for i = j+1:50
        y = dat(i,:);
        c = dot(x,y);
        sim = c/(norm(x,2)*norm(y,2));
    end
end

Is this correct? and The question is this: wath is the complexity or O(n) in this state?

sima412
  • 255
  • 2
  • 7
  • 16
  • 1
    "... in a line by line form." Do you mean row-wise or column-wise? – horchler Aug 13 '13 at 20:47
  • sorry i mean row-wise. Between two row – sima412 Aug 13 '13 at 20:55
  • is the problem a) finding the complexity of your chosen algorithm (MATLAB-independent), b) having an efficient algorithm for computing pairwise cosine similarity (again MATLAB-independent) or c) having an efficient/fast MATLAB implementation? please try to be concise in what is being asked. – gevang Aug 14 '13 at 21:02

2 Answers2

2

Just a note on an efficient implementation of the same thing using vectorized and matrix-wise operations (which are optimized in MATLAB). This can have huge time savings for large matrices:

dat = randn(50, 50);

OP (double-for) implementation:

sim = zeros(size(dat));
nRow = size(dat,1);
for j = 1:nRow
    x = dat(j, :);
    for i = j+1:nRow
        y = dat(i, :);
        c = dot(x, y);
        sim(j, i) = c/(norm(x,2)*norm(y,2));
    end
end

Vectorized implementation:

normDat = sqrt(sum(dat.^2, 2));           % L2 norm of each row 
datNorm = bsxfun(@rdivide, dat, normDat); % normalize each row 
dotProd = datNorm*datNorm';               % dot-product vectorized (redundant!) 
sim2 = triu(dotProd, 1);                  % keep unique upper triangular part 

Comparisons for 1000 x 1000 matrix: (MATLAB 2013a, x64, Intel Core i7 960 @ 3.20GHz)

Elapsed time is 34.103095 seconds.
Elapsed time is 0.075208 seconds.
sum(sum(sim-sim2))
ans =
    -1.224314766369880e-14
gevang
  • 4,994
  • 25
  • 33
  • thank you very much for note.is it true for hamming distance ? and any similarity measures? – sima412 Aug 14 '13 at 20:51
  • @sima412 look into `pdist2` for implementations of different similarity measures (including `hamming' and `cosine') defined for pairwise observations. – gevang Aug 14 '13 at 20:56
  • ok thank you gevang.but i use To one another type of hammin distance. i want use 1-haming. – sima412 Aug 15 '13 at 02:42
  • Is is possible for you guidance me that vectorized impementation of (1-hammin)؟ – sima412 Aug 15 '13 at 02:46
1

Better end with 49. Maybe you should also add an index to sim?

for j = 1:49
  x = dat(j,:);
  for i = j+1:50
      y = dat(i,:);
      c = dot(x,y);
      sim(j) = c/(norm(x,2)*norm(y,2));
  end
end

The complexity should be roughly like o(n^2), isn't it? Maybe you should have a look at correlation functions ... I don't get what you want to write exactly, but it looks like you want to do something similar. There are built-in correlation functions in Matlab.

  • thank you Melachtron . i want to obtain complexity cosine similarity. but another question I want to compare cosine similarity ans hamming distance but (1-hamming distance for obtain similarity between two vectors) . do you know what is complexity hamming distance but 1-hamming distance. and cod hamming in matlab? – sima412 Aug 14 '13 at 14:04
  • see http://www.mathworks.de/de/help/stats/pdist.html for Hamming distance -- you can use the built-in Matlab command if you want to program it on your own, plz have a look at the response of gevang – Melanchtron Aug 14 '13 at 20:57
  • 1
    you need a double index to `sim`, i.e. `sim(j, i)`, otherwise every run of the inner `i` loop will overwrite the previous (thus storing in `sim(j)` only the last call `i=j+50`). Effectively, you will need to store `n(n-1)/2` similarity values in a one- or two- dimensional array. The latter may be more convenient. – gevang Aug 19 '13 at 17:21