Numpy tensor implementation slower than loop

Question

I have two functions that compute the same metric. One ends up using a list comprehension to cycle through a calculation, the other uses only numpy tensor operations. The functions take in a (N, 3) array, where N is the number of points in 3D space. When N <~ 3000 the tensor function is faster, when N >~ 3000 the list comprehension is faster. Both seem to have linear time complexity in terms of N i.e two time-N lines cross at N=~3000.

def approximate_area_loop(section, num_area_divisions):        
    n_a_d = num_area_divisions
    interp_vectors = get_section_interp_(section)

    a1 = section[:-1]
    b1 = section[1:]
    a2 = interp_vectors[:-1]
    b2 = interp_vectors[1:]

    c = lambda u: (1 - u) * a1 + u * a2
    d = lambda u: (1 - u) * b1 + u * b2
    x = lambda u, v: (1 - v) * c(u) + v * d(u)

    area = np.sum([np.linalg.norm(np.cross((x((i + 1)/n_a_d, j/n_a_d) - x(i/n_a_d, j/n_a_d)),\
                                           (x(i/n_a_d, (j +1)/n_a_d) - x(i/n_a_d, j/n_a_d))), axis = 1)\
                   for i in range(n_a_d) for j in range(n_a_d)])

    Dt = section[-1, 0] - section[0, 0]
    return area, Dt

def approximate_area_tensor(section, num_area_divisions):
    divisors = np.linspace(0, 1, num_area_divisions + 1)
    interp_vectors = get_section_interp_(section)
    a1 = section[:-1]
    b1 = section[1:]
    a2 = interp_vectors[:-1]
    b2 = interp_vectors[1:]
    c = np.multiply.outer(a1, (1 - divisors)) + np.multiply.outer(a2, divisors) # c_areas_vecs_divs
    d = np.multiply.outer(b1, (1 - divisors)) + np.multiply.outer(b2, divisors) # d_areas_vecs_divs
    x = np.multiply.outer(c, (1 - divisors)) + np.multiply.outer(d, divisors) # x_areas_vecs_Divs_divs
    u = x[:, :, 1:, :-1] - x[:, :, :-1, :-1] # u_areas_vecs_Divs_divs
    v = x[:, :, :-1, 1:] - x[:, :, :-1, :-1] # v_areas_vecs_Divs_divs
    sub_area_norm_vecs = np.cross(u, v, axis = 1) # areas_crosses_Divs_divs
    sub_areas = np.linalg.norm(sub_area_norm_vecs, axis = 1) # areas_Divs_divs (values are now sub areas)
    area = np.sum(sub_areas)
    Dt = section[-1, 0] - section[0, 0]
    return area, Dt

Why does the list comprehension version work faster at large N? Surely the tensor version should be faster? I'm wondering if it's something to do with the size of the calculations meaning it's too big to be done in cache? Please ask if I haven't included enough information, I'd really like to get to the bottom of this.

I'm not sure what's the algorithm you are using and what you mean by approximating area tensor, but the outer products you have used generously in the numpy function make 3d arrays which for large Ns can grow really big and time consuming to make. Your list comprehension does not seem to generate them. Probably there's easier way to calculate those areas or someone has already done it. — anishtain4, Mar 21 '19 at 15:11
What's the shape of `u`? Your code is complex enough that it's hard to tell at a glance. A number of SO questions have observed that with very large arrays, there's a trade off between memory management complexity and iteration time. I wonder how a partial loop case would perform - where you loop only on a `i`. — hpaulj, Mar 21 '19 at 16:19
Overall your code is quite complex, using complex functions like `cross` and `norm`. So it's hard to isolate where the large-memory case slows down. You might have to do some more focused timings. — hpaulj, Mar 21 '19 at 16:28

score 0 · Accepted Answer · answered Mar 26 '19 at 17:58

The bottleneck in the fully vectorized function was indeed in the np.linalg.norm as @hpauljs comment suggested. Norm was used only to get the magnitude of all the vectors contained in axis 1. A much simpler and faster method was to just:

sub_areas = np.sqrt((sub_area_norm_vecs*sub_area_norm_vecs).sum(axis = 1))

This gives exactly the same results and sped up the code by up to 25 times faster than the loop implementation (even when the loop doesn't use linalg.norm either).

Numpy tensor implementation slower than loop

1 Answers1