1

I have a sparse array: term_doc

  • its size is 622256x715 of Float64. It is very sparse:

    • Of its ~444,913,040 cells, only about 22,215 of them normally are nonempty.
    • Of the 622256 rows only 4,699 are occupied
    • though of the 715 columns all are occupied.

The operator I would like to perform can be described as returning the row normalized and column normalized versions this matrix.

The Naive nonsparse version, I wrote is:

function doUnsparseWay()
    gc() #Force Garbage collect before I start (and periodically during). This uses alot of memory
    term_doc

    N = term_doc./sum(term_doc,1)
    println("N done")  

    gc()
    P = term_doc./sum(term_doc,2)    
    println("P done")
    gc()

    N[isnan(N)] = 0.0
    P[isnan(P)] = 0.0

    N,P,term_doc
end

Running this:

> @time N,P,term_doc= doUnsparseWay()
outputs:
N done
P done
elapsed time: 30.97332475 seconds (14466 MB allocated, 5.15% gc time in 13 pauses with 3 full sweep)

It is fairly simple. It chews memory, and will crash if the garbage collection does not occur at the right times (Thus I call it manually). But it is fairly fast


I wanted to get it to work on the sparse matrix. So as not to chew my memory out, and because logically it is a faster operation -- less cells need operating on.

I followed suggestions from this post and from the performance page of the docs.

function doSparseWay()
    term_doc::SparseMatrixCSC{Float64,Int64}

    N= spzeros(size(term_doc)...)
    N::SparseMatrixCSC{Float64,Int64}

    for (doc,total_terms::Float64) in enumerate(sum(term_doc,1))
        if total_terms == 0
            continue
        end
        @fastmath @inbounds N[:,doc] = term_doc[:,doc]./total_terms
    end
    println("N done")  

    P = spzeros(size(term_doc)...)'
    P::SparseMatrixCSC{Float64,Int64}

    gfs = sum(term_doc,2)[:]
    gfs::Array{Float64,1} 


    nterms = size(term_doc,1)
    nterms::Int64 
    term_doc = term_doc'

    @inbounds @simd for term in 1:nterms
        @fastmath @inbounds P[:,term] = term_doc[:,term]/gfs[term]
    end
    println("P done")
    P=P'


    N[isnan(N)] = 0.0
    P[isnan(P)] = 0.0

    N,P,term_doc
end

It never completes. It gets up to outputting "N Done", but never outputs "P Done". I have left it running for several hours.

  • How can I optimize it so it can complete in reasonable time?
  • Or if this is not possible, explain why.
Community
  • 1
  • 1
Frames Catherine White
  • 27,368
  • 21
  • 87
  • 137
  • FYI - In the code I've seen, sparse matrices call a hashing function (not inlined) to convert the virtual coordinates into absolute. The difference in speed between accessing elements in sparse and normal matrices is quite large. If you need speed, figure out a way to avoid using sparse matrices. – BitBank Feb 23 '15 at 16:23

1 Answers1

1

First, you're making term_doc a global variable, which is a big problem for performance. Pass it as an argument, doSparseWay(term_doc::SparseMatrixCSC). (The type annotation at the beginning of your function does not do anything useful.)

You want to use an approach similar to the answer by walnuss:

function doSparseWay(term_doc::SparseMatrixCSC)
    I, J, V = findnz(term_doc)
    normI = sum(term_doc, 1)
    normJ = sum(term_doc, 2)
    NV = similar(V)
    PV = similar(V)
    for idx = 1:length(V)
        NV[idx] = V[idx]/normI[J[idx]]
        PV[idx] = V[idx]/normJ[I[idx]]
    end
    m, n = size(term_doc)
    sparse(I, J, NV, m, n), sparse(I, J, PV, m, n), term_doc
end

This is a general pattern: when you want to optimize something for sparse matrices, extract the I, J, V and perform all your computations on V.

Community
  • 1
  • 1
tholy
  • 11,882
  • 1
  • 29
  • 42