Cosine Similarity in linear time using Lisp

Question

The cosine similarity of two lists can be calculated in linear time using a for-loop. I'm curious as to how one would achieve this using a Lisp-like language. Below is an example of my code in Python and Hy (Hylang).

Python:

def cos_sim(A,B):
    import math as _math
    n,da,db,d = 0,0,0,0

    for a,b in zip(A,B):
        n += a*b
        da += a*a
        db += b*b

    da = _math.sqrt(da)
    db = _math.sqrt(db)
    d = da*db

    return n / (d + 1e-32)

Hy (Lisp):

(import math)

(defn l2norm [a]
    (math.sqrt (reduce + (map (fn [s](* s s)) a))))

(defn dot [a b]
   (reduce + (map * a b)))

(defn cossim [a b]
   (/ (dot a b) (* (l2norm a) (l2norm b))))

score 2 · Answer 1 · answered Dec 09 '21 at 02:27

2

A simple option is to translate the Python version literally to Hy, like this:

(defn cos_sim [A B]
  (import math :as _math)
  (setv [n da db d] [0 0 0 0])

  (for [[a b] (zip A B)]
    (+= n (* a b))
    (+= da (* a a))
    (+= db (* b b)))

  (setv
    da (_math.sqrt da)
    db (_math.sqrt db)
    d (* da db))

  (/ n (+ d 1e-32)))

answered Dec 09 '21 at 02:27

Kodiologist

2,984
18
33

Thanks for this solution. I wonder if there might be a more Lisp-like way of doing this? Or perhaps this _is_ the most lisp-like way to do it? – jwnz Dec 09 '21 at 04:55

score 2 · Accepted Answer · answered Dec 09 '21 at 07:52

"I'm curious as to how one would achieve this using a Lisp-like language." This really depends on which Lisp you are using. In Scheme you might do something similar to the posted Hy solution:

(define (cos-sim-1 u v)
  (/ (dot-prod u v)
     (* (norm u) (norm v))))

(define (dot-prod u v)
  (fold-left + 0 (map * u v)))

(define (norm u)
  (sqrt (fold-left (lambda (acc x) (+ acc (* x x)))
                   0
                   u)))

This is linear in time complexity, but it could be improved by a constant factor by passing over the input only once. Scheme provides a named let construct that can be used to bind a name to a procedure; this is convenient here as it provides a simple mechanism for building the dot product and norms:

(define (cos-sim-2 u v)
  (let iter ((u u)
             (v v)
             (dot-product 0)
             (U^2 0)
             (V^2 0))
    (if (null? u)
        (/ dot-product (sqrt (* U^2 V^2)))
        (let ((x (car u))
              (y (car v)))
          (iter (cdr u)
                (cdr v)
                (+ dot-product (* x y))
                (+ U^2 (* x x))
                (+ V^2 (* y y)))))))

Both of these procedures assume that the input lists have the same length; it might be useful to add some validation code that checks this. Note that fold-left is standard in R6RS Scheme, but other standards rely on SRFIs for this, and some implementations may use different names, but the fold-left functionality is commonly available (perhaps as foldl or reduce).

It is possible to solve the problem in Common Lisp using either of the basic methods shown above, though in Common Lisp you would use labels instead of named let. But it would be typical to see a Common Lisp solution using the loop macro. The Common Lisp standard does not guarantee tail call elimination (though some implementations do support that), so explicit loops are seen much more often than in Scheme. The loop macro is pretty powerful, and one way that you could solve this problem while passing over the input lists only once is this:

(defun cos-sim (u v)
  (loop :for x :in u
        :for y :in v
        :sum (* x y) :into dot-product
        :sum (* x x) :into u2
        :sum (* y y) :into y2
        :finally (return (/ dot-product (sqrt (* u2 y2))))))

Here are some sample interactions:

Scheme (Chez Scheme):

> (cos-sim-1 '(1 0 0) '(1 0 0))
1
> (cos-sim-1 '(1 0 0) '(-1 0 0))
-1
> (cos-sim-1 '(1 0 0) '(0 1 0))
0
> (cos-sim-1 '(1 1 0) '(0 1 0))
0.7071067811865475

> (cos-sim-2 '(1 0 0) '(1 0 0))
1
> (cos-sim-2 '(1 0 0) '(-1 0 0))
-1
> (cos-sim-2 '(1 0 0) '(0 1 0))
0
> (cos-sim-2 '(1 1 0) '(0 1 0))
0.7071067811865475

Common Lisp:

CL-USER> (cos-sim '(1 0 0) '(1 0 0))
1.0
CL-USER> (cos-sim '(1 0 0) '(-1 0 0))
-1.0
CL-USER> (cos-sim '(1 0 0) '(0 1 0))
0.0
CL-USER> (cos-sim '(1 1 0) '(0 1 0))
0.70710677

I'm glad to see I created the same Loop; I've been meaning to adopt the keyword syntax I see people use here, as I think it is indeed more readable. — Halbert, Dec 09 '21 at 08:10

score 2 · Answer 3 · answered Dec 09 '21 at 08:05

I think your proposed solution is fairly 'lispy': build several short, easy to read functions that combine into your solution. EG:

(defun n (A B)
    (sqrt (reduce #'+ (map 'list #'* A B))))

(defun da (A)
    (sqrt (reduce #'+ (map 'list #'* A A))))

(defun db (B)
    (sqrt (reduce #'+ (map 'list #'* B B))))

(defun cos-sim (A B)
    (let ((n (n A B))
          (da (da A))
          (db (db B)))
      (/ (* n n) (+ (* da db) 1e-32)))

But, notice that n, da, and db look very similar. We can see if we can make those a single function, or macro. In this case, a function with an optional second list parameter is easy enough. (And note that I've defined n in a slightly weird way to emphasize this, but we might prefer not to take a square root and then square it for our final calculation. This would be easy to change by checking for passing the optional parameter (included as B-p below); I chose to move the square root inside the combined function) Anyway, this gives us:

 (defun d (A &optional (B A B-p))
    (reduce #'+ (map 'list #'* A B)))
 
 (defun cos-sim (A B)
    (let ((n (d A B))
          (da (sqrt (d A)))
          (db (sqrt (d B))))
      (/ n (+ (* da db) 1e-32))))

Alternately, using Loop is very Common Lisp-y, and is more directly similar to the python:

(defun cos-sim (A B)
    (loop for a in A
          for b in B
          sum (* a b) into n
          sum (* a a) into da
          sum (* b b) into db
          finally (return (/ n (+ (sqrt (* da db)) 1e-32)))))

score 2 · Answer 4 · answered Dec 09 '21 at 11:14

Here is a fairly natural (I think) approach in Racket. Essentially this is a process of folding a pair of sequences of numbers, so that's what we do. Note that this uses no explicit assignment, and also pulls the square root up a level (sqrt(a) * sqrt(b) = sqrt(a*b) as taking roots is likely expensive (this probably does not matter in practice). It also doesn't do the weird adding of a tiny float, which I presume was an attempt to coerce a value which might not be a float to a float? If so that's the wrong way to do that, and it's also not needed in a language like Racket (and most Lisps) which strive to do arithmetic correctly where possible.

(define (cos-sim a b)
  ;; a and b are sequences of numbers
  (let-values ([(a^2-sum b^2-sum ab-sum)
                (for/fold ([a^2-running 0]
                           [b^2-running 0]
                           [ab-running 0])
                          ([ai a] [bi b])
                  (values (+ (* ai ai) a^2-running)
                          (+ (* bi bi) b^2-running)
                          (+ (* ai bi) ab-running)))])
    (/ ab-sum (sqrt (* a^2-sum b^2-sum)))))

You can relatively easily turn this into typed Racket:

(define (cos-sim (a : (Sequenceof Number))
                 (b : (Sequenceof Number)))
  : Number
  (let-values ([(a^2-sum b^2-sum ab-sum)
                (for/fold ([a^2-running : Number 0]
                           [b^2-running : Number 0]
                           [ab-running : Number 0])
                          ([ai a] [bi b])
                  (values (+ (* ai ai) a^2-running)
                          (+ (* bi bi) b^2-running)
                          (+ (* ai bi) ab-running)))])
    (/ ab-sum (sqrt (* a^2-sum b^2-sum)))))

This probably is no faster, but it is fussier.

This might be faster though:

(define (cos-sim/flonum (a : (Sequenceof Flonum))
                        (b : (Sequenceof Flonum)))
  : Flonum
  (let-values ([(a^2-sum b^2-sum ab-sum)
                (for/fold ([a^2-running : Flonum 0.0]
                           [b^2-running : Flonum 0.0]
                           [ab-running : Flonum 0.0])
                          ([ai a] [bi b])
                  (values (+ (* ai ai) a^2-running)
                          (+ (* bi bi) b^2-running)
                          (+ (* ai bi) ab-running)))])
    (/ ab-sum (assert (sqrt (* a^2-sum b^2-sum)) flonum?))))

I have not checked it is however.

I thought about adding a Racket solution, but my answer was getting long enough as it was. With its many iteration forms, I'm glad that someone wrote an answer using Racket. — ad absurdum, Dec 09 '21 at 23:29
Although somewhat transgressing from the original question, I originally added the small float thing to suppress divide-by-zero errors. But now I see that that's a poor way to handle the exception haha. It also explains why some of my similarity values were exploding. — jwnz, Dec 10 '21 at 01:27

score 2 · Answer 5 · answered Dec 09 '21 at 15:33

Your Hy example is already linear time. None of the nested loops multiply their number of iterations based on the length of input. It could be simplified to make this easier to see

(import math)

(defn dot [a b]
   (sum (map * a b)))

(defn l2norm [a]
    (math.sqrt (dot a a)))

(defn cossim [a b]
   (/ (dot a b) (* (l2norm a) (l2norm b))))

I think this version is clearer than the Python version, because it's closer to the math notation.

Let's also inline the l2norm to make the number of loops easier to see.

(defn cossim [a b]
   (/ (dot a b)
      (* (math.sqrt (dot a a))
         (math.sqrt (dot b b)))))

Python's map() is lazy, so the sum() and map() together only loop once. You effectively have three loops, one for each dot, and none of them are nested. Your Python version had one loop, but it was doing more calculations each iteration. Theoretically, it doesn't matter if you calculate row-by-row or column-by-column: multiplication is commutative, either rows by columns or columns by rows are the same number of calculations.

However, in practice, Python does have significant overhead for function calls, so I would expect the Hy version using higher-order functions to be slower than the Python version that doesn't have any function calls in the loop body. This is a constant factor slowdown, so it's still linear time.

If you want fast loops for calculations in Python, put your data in a matrix and use Numpy.

Cosine Similarity in linear time using Lisp

5 Answers5