Efficient vector operations of linear algebra in Common Lisp, especially SBCL?

Question

The Program below seems very inefficient. It takes as much as 28.980 GC time, in contrast 6.361 sec non-GC time, with SBCL 1.0.53.

(deftype vec3 () '(simple-array double-float (3)))

(declaim (inline make-vec3 vec3-zero
             vec3-x vec3-y vec3-z
             vec3-+))

(defun make-vec3 (x y z)
  (declare (optimize (speed 3) (safety 0)))
  (make-array 3 :element-type 'double-float
                :initial-contents (list x y z)))

(defun vec3-zero ()
  (make-vec3 0.0d0 0.0d0 0.0d0))

(defun vec3-x (x)
  (declare (optimize (speed 3) (safety 0)))
  (declare (type (simple-array double-float (3)) x))
  (aref x 0))

(defun vec3-y (x)
  (declare (optimize (speed 3) (safety 0)))
  (declare (type (simple-array double-float (3)) x))
  (aref x 1))

(defun vec3-z (x)
  (declare (optimize (speed 3) (safety 0)))
  (declare (type (simple-array double-float (3)) x))
  (aref x 2))

(defun vec3-+ (a b)
  (declare (optimize (speed 3) (safety 0)))
  (make-vec3 (+ (vec3-x a) (vec3-x b))
             (+ (vec3-y a) (vec3-y b))
             (+ (vec3-z a) (vec3-z b))))


;; main

(defun image (x y)
  (make-array (* x y) :element-type 'vec3 :initial-element (vec3-zero)))

(defun add (to from val)
  (declare (type (simple-array vec3 (*)) to from)
           (type vec3 val)
           (optimize (speed 3) (safety 0)))
  (let ((size (array-dimension to 0)))
    (dotimes (i size)
      (setf (aref to i) (vec3-+ (aref from i) val)))))

(defun main ()
  (let ((to (image 800 800))
        (x (make-vec3 1.0d0 1.0d0 1.0d0)))
    (time (dotimes (i 200)
            (add to to x)))
    (print (aref to 0))))

time:

* (main)
Evaluation took:
  39.530 seconds of real time
  35.340237 seconds of total run time (25.945526 user, 9.394711 system)
  [ Run times consist of 28.980 seconds GC time, and 6.361 seconds non-GC time. ]
  89.40% CPU
  83,778,297,762 processor cycles
  46 page faults
  6,144,014,656 bytes consed


#(200.0d0 200.0d0 200.0d0) 
#(200.0d0 200.0d0 200.0d0)

Are there any approach to compute it in more efficient way, keeping the vec3 abstraction?

For example, implementing Worker/Wrapper transformation using macro can eliminate the vec3 conses.

As another way, making cons pool for vec3 will decrease the memory allocation.

Ideally, it would be nice that SBCL supports non-descriptor representations for some data structure, like vec3, as array elements.

Maybe here you'll be able to find hints for improving performance: http://random-state.net/log/3530433886.html — Vsevolod Dyomkin, Dec 02 '11 at 20:54
Thanks. It's not about structured data like the vec3, but flat floating point operations. — masayuki takagi, Dec 05 '11 at 15:25

score 5 · Accepted Answer · answered Dec 02 '11 at 19:00

I think in these situations, making use of macros can be a good idea. Next, I'd always hesitate to declare (safety 0), it brings very very slight performance gains and can result in strange behaviour if, only the code inside the defun, but also all code calling the defun, is not absolutely correct.

The important thing here I think is to not make a new list object in make-vec3. I attach a somewhat quick and dirty optimization of your code. On my machine the original code runs in

; cpu time (non-gc) 27.487818 sec user, 0.008999 sec system
; cpu time (gc)     17.334368 sec user, 0.001999 sec system
; cpu time (total)  44.822186 sec user, 0.010998 sec system
; real time  44.839858 sec
; space allocation:
;  0 cons cells, 45,056,000,000 other bytes, 0 static bytes

and my version runs in

; cpu time (non-gc) 4.075385 sec user, 0.001000 sec system
; cpu time (gc)     2.162666 sec user, 0.000000 sec system
; cpu time (total)  6.238051 sec user, 0.001000 sec system
; real time  6.240055 sec
; space allocation:
;  8 cons cells, 8,192,030,976 other bytes, 0 static bytes

This is using Allegro. YMMV on other lisps. You mention pooling conses/memory for vec3 arrays, and I think reusing those objects, i.e. modifying them destructively, is a good idea when you have the opportunity to do so. On my lisp a vec3 takes 64 bytes, which is quite a bit... Another useful thing is of course to invoke the profiler to see where time is spent. Also, in these math-heavy problems, it is important that array references and arithmetic are open coded as much as possible. Mosts lisp can (dissassemble 'my-function), which gives a good idea if these operations were indeed open coded or if the runtime is invoked.

(deftype vec3 () '(simple-array double-float (3)))

(declaim (optimize (speed 3) (debug 0) (safety 1)))

(defmacro make-vec3 (x y z)
  `(let ((vec3 
     (make-array 3 :element-type 'double-float :initial-element 0.0d0)))
   (setf (aref vec3 0) ,x
         (aref vec3 1) ,y
         (aref vec3 2) ,z)
     vec3))


(defun vec3-zero ()
  (make-vec3 0.0d0 0.0d0 0.0d0))

(defmacro vec3-x (x)
  `(aref ,x 0))

(defmacro vec3-y (x)
  `(aref ,x 1))

(defmacro vec3-z (x)
  `(aref ,x 2))

(defun vec3-+ (a b)
  (declare (type vec3 a b))
  (make-vec3 (+ (vec3-x a) (vec3-x b))
             (+ (vec3-y a) (vec3-y b))
             (+ (vec3-z a) (vec3-z b))))

(defun image (x y)
  (make-array (* x y) :element-type 'vec3 :initial-element (vec3-zero)))

(defun add (to from val)
  (declare (type (simple-array vec3 (*)) to from)
           (type vec3 val))
  (let ((size (array-dimension to 0)))
    (dotimes (i size)
      (setf (aref to i) (vec3-+ (aref from i) val)))))

(defun main ()
  (let ((to (image 800 800))
        (x (make-vec3 1.0d0 1.0d0 1.0d0)))
    (time (dotimes (i 200)
            (add to to x)))
    (print (aref to 0))))

Thanks, but your code seems to make no performance improvement in my environment. Disassembling the add function, my code is well open coded and doesn't invoke the runtime. I will try to make use of macros as you say it can be a good idea. Anyway, I'm envious that you can use Allegro. — masayuki takagi, Dec 04 '11 at 14:18
I think SBCL is much more aggresive when it comes to optimizing numeric code than Allegro. Allegro can get the same performance, but needs much more type declarations. Another thing I have not tried which may (or not) improve performance somewhat is making the vec3 objects conses instead of arrays. — , Dec 04 '11 at 15:48
As you say, SBCL works very efficiently when calculating x, y and z elements coded directly as double-float operations only, without the vec3 data abstraction. So, I suspect that the difference in the performance of my environment and yours is caused by the memory allocation of the make-array in the make-vec3 function/macro. I will try to use conses instead of arrays. — masayuki takagi, Dec 05 '11 at 10:50
You could also consider making a destructive version of vec3-+, reusing one of its arguments. Make sure to change the :initial-element in image() as well! — , Dec 05 '11 at 23:12
Ok! I have already got the destructive version of vec3-+, with fixing the :initial-element in image() as you point out, and it works very efficiently. It runs x1/2 as fast as the similar code in C compiled using gcc, and the difference of performance in CL and C seems to be caused by some slight differences in the machine codes, performing addition with "addsd" opcode, generated by SBCL and gcc, which I checked out with SBCL's disassemble function and gcc's -S option. — masayuki takagi, Dec 06 '11 at 05:30
The difference of performance in CL and C is not caused by "addsd" opcode, but by the references of vec3 arrays, in which two references are needed to get double-float elements of vec3 array in my code, as "array -> vec3 -> double-float". Using double-float array, which has x3 elements, instead of vec3 array, with appropriate macro abstraction, I've gained nice performance improvement as fast as C. — masayuki takagi, Dec 07 '11 at 14:48
@JohanBenumEvensberget: I have seen the term 'open coded' in various places. However, I am not aware of a definition? Can you provide one, or a reference to one? Thanks. — Faheem Mitha, Jan 04 '13 at 05:27
@masayukitakagi I'm not clear what you mean by "Using double-float array, which has x3 elements, instead of vec3 array, with appropriate macro abstraction". Could you update your question, or perhaps answer the question yourself, with your updated, improved code? It would be interesting for me. Thanks. — Faheem Mitha, Jan 04 '13 at 08:03
@FaheemMitha I mean that if I want to use 100 vectors which contain x, y and z elements each, I can use an array which has 300 double-float elements, then make abstraction to operate, for example, x element of Nth element. Using this approach, the operation is simply on double-float simple vector and very efficient. — masayuki takagi, Jan 06 '13 at 13:09

Efficient vector operations of linear algebra in Common Lisp, especially SBCL?

1 Answers1