3

Introduction and source code

I am trying to compute the cosine similarity between two sparse vectors of dimension 169647.As input, the two vectors are represented as a string of the form <index, value>. Only the non zero elements of the vector are given an index.

x = "1:0.1 43:0.4 100:0.43 10000:0.9"
y = "200:0.5 500:0.34 501:0.34"

First we convert each of x and y into two vectors<float>. by using the function splitVector. Then we compute the distance by using the function cosine_similarity. Nevermind split function. I am using it just in case you wish to run the code.

#include <iostream>
#include <string>
#include <vector> 
#include <algorithm>

using namespace std;

void split(const string& s, char c,vector<string>& v) {
   string::size_type i = 0;
   string::size_type j = s.find(c);

   while (j != string::npos) {
      v.push_back(s.substr(i, j-i));
      i = ++j;
      j = s.find(c, j);

      if (j == string::npos)
         v.push_back(s.substr(i, s.length()));
   }
}

float cosine_similarity(const std::vector<float> & A,const std::vector<float> & B)
{
    float dot = 0.0, denom_a = 0.0, denom_b = 0.0 ;
    for(unsigned int i = 0; i < A.size(); ++i)
    {
        dot += A[i] * B[i] ;
        denom_a += A[i] * A[i] ;
        denom_b += B[i] * B[i] ;
    }
    return dot / (sqrt(denom_a) * sqrt(denom_b)) ;
}

void splitVector(const vector<string> & v, vector<float> & values)
{
    vector<string> tmpv;
    string parsed;
    for(unsigned int i = 0; i < v.size(); i++)
    {
        split(v[i], ':', tmpv);
        int idx = atoi(tmpv[0].c_str());
        float val = atof(tmpv[1].c_str()); 
    tmpv.clear();
    values[idx] = val;
    }//end for;
}//end function

int main()
{
   //INPUT VECTORS.
   vector<string> x {"1:0.1","43:0.4","50:0.43","90:0.9"};
   vector<string> y {"20:0.5","40:0.34","50:0.34"};
   
   //STEP 1: Initialize vectors
   int dimension = 169647;
   vector<float> X;
   X.resize(dimension, 0.0);
   
   vector<float> Y;
   Y.resize(dimension, 0.0);
   
   //STEP 2: CREATE FLOAT VECTORS
   splitVector(x, X);
   splitVector(y, Y);
   
   //STEP 3: COMPUTE COSINE SIMILARITY
   cout << cosine_similarity(X,Y) << endl;
}

Problem and proposed solution

Initializing and filling the vector<float> is a problem. It is really taking so much execution time. I was thinking of using the std::map<int,float> structure in c++. where X and Y will be represented by :

std::map<int,float> x_m{ make_pair(1,0.1), make_pair(43,0.4), make_pair(50,0.43), make_pair(90,0.9)};
std::map<int,float> y_m{ make_pair(20,0.5), make_pair(40,0.34), make_pair(50,0.34)};

For this purpose I used the following function:

float cosine_similarity(const std::map<int,float> & A,const std::map<int,float> & B)
{
    float dot = 0.0, denom_a = 0.0, denom_b = 0.0 ;
    for(auto &a:A)
    { 
      denom_a += a.second * a.second ;
    }
    
    for(auto &b:B)
    { 
      denom_b += b.second * b.second ;
    }
    
    for(auto &a:A)
    {  
        if(B.find(a.first) != B.end())
        {
          dot +=  a.second * B.find(a.first)->second ;
        }  
    }
    return dot / (sqrt(denom_a) * sqrt(denom_b)) ;
}

Question

  • Can you help me with the math of the complexity?
  • Will the second proposed function that uses maps reduce the complexity?
  • What do you think of the solution?
Community
  • 1
  • 1
Hani Goc
  • 2,371
  • 5
  • 45
  • 89
  • 1
    Have you considered just using an existing library to compute this? – Jørgen Fogh Jan 20 '16 at 11:19
  • 1
    I think computing it using vectors may be faster than maps.. what is the problem in vectors extaly ? – Humam Helfawi Jan 20 '16 at 11:20
  • no @JørgenFogh I would really like to. I was using python before. But as you can see I moved to C++. – Hani Goc Jan 20 '16 at 11:20
  • 1
    better if use matlab :) – dynamic Jan 20 '16 at 11:21
  • @ I have to fill the vectors. I will be creating a sparse vectors of dimension **169647** then computing the distance between them. – Hani Goc Jan 20 '16 at 11:21
  • your problem is similar to merging two sorted arrays (lookup mergesort to find an example of `merge`): in your case you'll have four arrays, two with the (sorted) indices and two with the values at those indices. iterate over both index arrays concurrently comparing values at equal index entries and replacing all unmatched indices in one with a value of zero in the other. – BeyelerStudios Jan 20 '16 at 13:08
  • @BeyelerStudios If I understand correctly but isn't MAP doing what exactly you are talking about? – Hani Goc Jan 20 '16 at 13:09
  • 1
    @HaniGoc the `map` has an unnecessary `O(nlogn)` construction time if your indices are already in order (which in most sparse vector representations they are) so it will dominate the distance comparison which is `O(n)`. Then there's the overhead of iterating through two maps instead of incrementing two indices (or four pointers). – BeyelerStudios Jan 20 '16 at 13:12
  • 1
    @HaniGoc I guess the point is, if it's as easy or easier to implement and maintain the faster array version, why bother with a map? – BeyelerStudios Jan 20 '16 at 13:15
  • @BeyelerStudios well yes. actually the map added some extra programming complexity to the code. you are right. – Hani Goc Jan 20 '16 at 13:17
  • Your `split` function is very inefficient because it produces lots of dynamically allocated strings. Try to use std::istringstream instead. – gudok Jan 20 '16 at 13:46
  • 1
    As for cosine algorithm itself, proper implementation depends on how many non-zero elements in typical vectors. If nearly all of elements are non-zero, then simply use `std::vector` with fixed length 169647. If there are only few non-zero elements (as in your example), use `std::vector >`, fill it with `split`, sort by dimension index, and then implement cosine similarity by scanning both vectors left-to-right simultaneously (similar to merge stage of merge sort). – gudok Jan 20 '16 at 13:52
  • @gudok I am actually using "boost::split(parts, tfidf, boost::is_any_of(" "));" but just in here to be able to run the code. independently of the library – Hani Goc Jan 20 '16 at 13:52
  • @gudok. Hold on. Suppose I keep the vector of dimension **169647** filled with zero. And I compute the cosine similarity. Will that be faster and less complex that using the map. oh well i'll just keep it. It's really complicated anyway. – Hani Goc Jan 20 '16 at 13:55
  • I thought that by using map i'll reduce the complexity of the algorithm. – Hani Goc Jan 20 '16 at 13:55
  • 1
    I doubt that you would be able to reduce the complexity of the algorithm with map. If your vectors are small, then it is all about memory access rather than about asymptotics... Anyway, if you want to try map, 1) use `std::unordered_map` instead of `std::map` (ideally, with properly guessed initial capacity); 2) create only single map per two vectors, e.g. `std::unordered_map >` -- then in cosine similarity algorithm, it will require only single pass over the this map. – gudok Jan 20 '16 at 14:07

2 Answers2

2

The common representation of sparse vectors is a simple array of indices and one of values or sometimes an array of pairs of indices and values as usually you need to access the index together with the value (except when you don't like for vector length / normalisation or similar). Two other forms were suggested: using std::map and std::unordered_map.

Please find the conclusion at the end.

Benchmark

I implemented the vector operations length and inner product (dot product) for those four representations. In addition I implemented the inner product the very straight forward way suggested in the OP's question and an improved cosine distance computation on the pairs of vectors implementation.

Complete code

I've run a benchmark on those implementations. You can check out my code from this link from where I took the following numbers (though the ratios match up pretty neatly with the runs on my own machine only with way higher RunCount for a more even spread of random input vectors). Here are the results:

Results

Explanation of the output of the benchmark:
  pairs: implementation using (sorted) std::vector of pairs
  map'd: implementation using std::map
  hashm: implementation using std::unordered_map
  class: implementation using two separate std::vector for indices and values respectively
  specl dot (naive map): dot product using map.find instead of proper iteration
  specl cos (optimised): cosine distance iterating only once over both vectors

Columns are the percentage of non-zeros in the random sparse vector (on average).
Values are in terms of the vector of pairs implementation
(1: equal runtime, 2: took twice as long, 0.5: took half as long).

                    inner product (dot)
            5%          10%          15%          25%
map'd       3.3          3.5          3.7          4.0
hashm       3.6          4.0          4.8          5.2
class       1.1          1.1          1.1          1.1
special[1]  8.3          9.8         10.7         10.8

                    norm squared (len2)
            5%          10%          15%          25%
map'd       6.9          7.6          8.3         10.2
hashm       2.3          3.6          4.1          4.8
class       0.98         0.95         0.93         0.75

                    cosine distance (cos)
            5%          10%          15%          25%
map'd       4.0          4.3          4.6          5.0
hashm       3.2          3.9          4.6          5.0
class       1.1          1.1          1.1          1.1
special[2]  0.92         0.95         0.93         0.94

Implementation under test

Except for the special[2]-case I used the following cosine distance function:

template<class Vector>
inline float CosineDistance(const Vector& lhs, const Vector& rhs) {
    return Dot(lhs, rhs) / std::sqrt(LenSqr(lhs) * LenSqr(rhs));
}

Containers of Pair

Here is the implementation of Dot for both a sorted vector<pair<size_t,float>> and a map<size_t,float>:

template<class PairContainerSorted>
inline float DotPairsSorted(const PairContainerSorted& lhs, const PairContainerSorted& rhs) {
    float dot = 0;
    for(auto pLhs = lhs.begin(), pRhs = rhs.begin(), endLhs = lhs.end(), endRhs = rhs.end(); pRhs != endRhs;) {
        for(; pLhs != endLhs && pLhs->first < pRhs->first; ++pLhs);
        if(pLhs == endLhs)
            break;
        for(; pRhs != endRhs && pRhs->first < pLhs->first; ++pRhs);
        if(pRhs == endRhs)
            break;
        if(pLhs->first == pRhs->first) {
            dot += pLhs->second * pRhs->second;
            ++pLhs;
            ++pRhs;
        }
    }
    return dot;
}

This is the implementation for Dot for both the unordered map and the special[1] (equal to the OP's implementation):

template<class PairMap>
inline float DotPairsMapped(const PairMap& lhs, const PairMap& rhs) {
    float dot = 0;
    for(auto& pair : lhs) {
        auto pos = rhs.find(pair.first);
        if(pos != rhs.end())
            dot += pair.second * pos->second;
    }
    return dot;
}

The implementation of LenSqr:

template<class PairContainer>
inline float LenSqrPairs(const PairContainer& vec) {
    float dot = 0;
    for(auto& pair : vec)
        dot += pair.second * pair.second;
    return dot;
}

Pair of Vector

Note that I packed the pair of vectors into the struct (or class) SparseVector (check the complete code for details):

inline float Dot(const SparseVector& lhs, const SparseVector& rhs) {
    float dot = 0;
    if(!lhs.idx.empty() && !rhs.idx.empty()) {
        const size_t *itIdxLhs = &lhs.idx[0], *endIdxLhs = &lhs.idx[0] + lhs.idx.size();
        const float *itValLhs = &lhs.val[0], *endValLhs = &lhs.val[0] + lhs.val.size();
        const size_t *itIdxRhs = &rhs.idx[0], *endIdxRhs = &rhs.idx[0] + rhs.idx.size();
        const float *itValRhs = &rhs.val[0], *endValRhs = &rhs.val[0] + rhs.val.size();
        while(itIdxRhs != endIdxRhs) {
            for(; itIdxLhs < endIdxLhs && *itIdxLhs < *itIdxRhs; ++itIdxLhs, ++itValLhs);
            if(itIdxLhs == endIdxLhs)
                break;
            for(; itIdxRhs < endIdxRhs && *itIdxRhs < *itIdxLhs; ++itIdxRhs, ++itValRhs);
            if(itIdxRhs == endIdxRhs)
                break;
            if(*itIdxLhs == *itIdxRhs) {
                dot += (*itValLhs) * (*itValRhs);
                ++itIdxLhs;
                ++itValLhs;
                ++itIdxRhs;
                ++itValRhs;
            }
        }
    }
    return dot;
}

inline float LenSqr(const SparseVector& vec) {
    float dot = 0;
    for(float v : vec.val)
        dot += v * v;
    return dot;
}

special[2] simply computes the squared norm of both vectors while iterating through them during the inner product (check the complete code for details). I've added this to proof a point: cache hits matter. I can beat the naive approach of vectors of pairs with the pairs of vectors one, if I just access my memory more efficiently (same would be true if you optimised the other paths of course).

Conclusion

Note that all tested implementation (except for special[1] which has O(k*logk) behaviour) exhibit a theoretical runtime behaviour of O(k) where k is the number of non-zeros in the sparse vector: that is trivial to see for map and vector as the implementation of Dot is the same and the unordered map achieves this by implementing find in O(1) amortised.

Why then is a map the wrong tool for a sparse vector? For std::map the answer is the overhead of iterating a tree structure, for std::unordered_map the random memory access pattern for find, both resulting in huge overhead during cache misses.

To demystify the theoretical benefit of std::unordered_map over std::map check the results for special[1]. This is the implementation which the std::unordered_map is beating, not because it's better suited for the problem but because the implementation using std::map was sub-optimal.

BeyelerStudios
  • 4,243
  • 19
  • 38
1

Say that N = 169647, and the dimensions of the two in practice are m, n, respectively.

Regarding your questions:

  • The original complexity is Θ(N2).

  • The complexity of your proposed solution is O((m + n) log(max(m, n)), which will probably be far smaller; using std::unordered_map instead, you can reduce this to expected O(m + n).

  • Sounds good, but, as always - YMMV. You should profile both this op in the context of your entire application (to see whether it's an issue), and the steps within this op.

Ami Tavory
  • 74,578
  • 11
  • 141
  • 185