9

I'm curious why Julias implementation of matrix addition appears to make a copy. Heres an example:

foo1=rand(1000,1000)
foo2=rand(1000,1000)
foo3=rand(1000,1000)

julia> @time foo1=foo2+foo3;
  0.001719 seconds (9 allocations: 7.630 MB)
julia> sizeof(foo1)/10^6
8.0

The amount of memory allocated is roughly the same as the memory required by a matrix of these dimensions.

It looks like in order to process foo2+foo3 memory is allocated to store the result and then foo1 is assigned to it by reference.

Does this mean that for most linear algebra operations we need to call BLAS and LAPACK functions directly to do things in place?

Lindon
  • 1,292
  • 1
  • 10
  • 21

3 Answers3

15

To understand what is going on here, let's consider what foo1 = foo2 + foo3 actually does.

  • First it evaluates foo2 + foo3. To do this it will allocate a new temporary array to hold the output
  • Then it will bind the name foo1 to this new temporary array, undoing all effort you put in to pre-allocate the output array.

In short, you see that memory usage is about that of the resultant array because the routine is indeed allocating new memory for an array of that size.

Here are some alternatives:

  • write a loop
  • use broadcast!
  • We could try do do copy!(foo1, foo2+foo3) and then the array you pre-allocated will be filled, but it will still allocate the temporary (see below)
  • The original version posted here

Here's some code for those 4 cases

julia> function with_loop!(foo1, foo2, foo3)
           for i in eachindex(foo2)
              foo1[i] = foo2[i] + foo3[i]
           end
       end

julia> function with_broadcast!(foo1, foo2, foo3)
           broadcast!(+, foo1, foo2, foo3)
       end

julia> function with_copy!(foo1, foo2, foo3)
           copy!(foo1, foo2+foo3)
       end

julia> function original(foo1, foo2, foo3)
           foo1 = foo2 + foo3
       end

Now let's time these functions

julia> for f in [:with_broadcast!, :with_loop!, :with_copy!, :original]
           @eval $f(foo1, foo2, foo3) # compile
           println("timing $f")
           @eval @time $f(foo1, foo2, foo3)
       end
timing with_broadcast!
  0.001787 seconds (5 allocations: 192 bytes)
timing with_loop!
  0.001783 seconds (4 allocations: 160 bytes)
timing with_copy!
  0.003604 seconds (9 allocations: 7.630 MB)
timing original
  0.002702 seconds (9 allocations: 7.630 MB, 97.91% gc time)

You can see that with_loop! and broadcast! do about the same and both are much faster and more efficient than the others. with_copy! and original are both slower and use more memory.

In general, to do inplace operations I'd recommend starting out by writing a loop

spencerlyon2
  • 9,476
  • 4
  • 30
  • 39
2

First, read @spencerlyon2's answer. Another approach is to use Dahua Lin's package Devectorize.jl. It defines the @devec macro which automates the translations of vector (array) expressions into looped code.

In this example we will define with_devec!(foo1,foo2,foo3) as follows,

julia> using Devectorize      # install with Pkg.add("Devectorize")

julia> with_devec!(foo1,foo2,foo3) = @devec foo1[:]=foo2+foo3

Running the benchmark achieves the 4 allocations results.

Dan Getz
  • 17,002
  • 2
  • 23
  • 41
1

You can use axpy! function from the LinearAlgebra package.

using LinearAlgebra
julia> @time BLAS.axpy!(1., foo2, foo3)
  0.002187 seconds (4 allocations: 160 bytes)
Xward
  • 85
  • 4
  • 1
    This is a bad answer as for almost all cases, `foo1 .= foo2 .+ foo3` is more appropriate – Oscar Smith May 28 '20 at 00:58
  • More appropriate in what aspect? If you benchmark, the median runtime of axpy! is 1.2ms whereas your suggestion is 1.8ms. – Xward May 31 '20 at 18:04