5

So, continuing from the discussion @TheBlackCat and I were having in this answer, I would like to know the best way to pass arguments to a Numpy vectorized function. The function in question is defined thus:

vect_dist_funct = np.vectorize(lambda p1, p2: vincenty(p1, p2).meters)

where, vincenty comes from the Geopy package.

I currently call vect_dist_funct in this manner:

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point))
    if len(intersect) > 0:
        points = pd.Series([point]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        return pd.Series(dist, index=intercept, name='Dist').sort_values()
    else:
        return pd.Series(np.nan, index=[0], name='Dist')

points['geometry'].apply(lambda x: pointer(point=x.coords[0], centroid=line['centroid'], tree_idx=tree_idx))

(Please refer to the question here: Labelled datatypes Python)

My question pertains to what happens inside the function pointer. The reason I am converting points to a pandas.Series and then getting the values (in the 4th line, just under the if statement) is to make it in the same shape as polygons. If I merely call points either as points = [point]*len(intersect) or as points = itertools.repeat(point, len(intersect)), Numpy complains that it "cannot broadcast arrays of size (n,2) and size (n,) together" (n is the length of intersect).

If I call vect_dist_funct like so: dist = vect_dist_funct(itertools.repeat(points, len(intersect)), polygons), vincenty complains that I have passed it too many arguments. I am at a complete loss to understand the difference between the two.

Note that these are coordinates, therefore will always be in pairs. Here are examples of how point and polygons look like:

point = (-104.950752   39.854744) # Passed directly to the function like this.
polygons = array([(-104.21750802451864, 37.84052458697633),
                  (-105.01017084789603, 39.82012158954065),
                  (-105.03965315742742, 40.669867471420886),
                  (-104.90353460825702, 39.837631505433706),
                  (-104.8650601872832, 39.870796282334744)], dtype=object)
           # As returned by statement centroid.loc[intersect].values

What is the best way to call vect_dist_funct in this circumstance, such that I can have a vectorized call, and both Numpy and vincenty will not complain that I am passing wrong arguments? Also, techniques that result in minimum memory consumption, and increased speed are sought. The goal is to compute distance between the point to each polygon centroid.

Community
  • 1
  • 1
Kartik
  • 8,347
  • 39
  • 73
  • After a quick glance I find your question unclear. Anyway, as for "*cannot broadcast arrays of size (n,2) and size (n,) together*": you can only broadcast `(n,2)` with shape `(n,1)`, so you have to do introduce a trailing singleton with `[:,None]` into your 1d array to make something like this work. – Andras Deak -- Слава Україні Jun 28 '16 at 17:54
  • Also, watch out with things like `[point]*len(intersect)`: you might end up with a list containing *the reference to the same variable*. If you mutate this list later, you could end up mistakenly mutating all the elements of the list at once. It's usually safer to do something like `[p for _ in range(len(intersect))]`. – Andras Deak -- Слава Україні Jun 28 '16 at 17:58
  • 1
    @AndrasDeak, what part of the question is not clear? I will edit it accordingly. Basically what I am trying to do is compute distances between one point and a set of points, in the most efficient manner possible. I thought Numpy vectorize will speed things up for me. – Kartik Jun 28 '16 at 18:30
  • OK, I see. A loop is probably indeed your best choice. – Andras Deak -- Слава Україні Jun 28 '16 at 18:34

1 Answers1

3

np.vectorize doesn't really help you here. As per the documentation:

The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.

In fact, vectorize actively hurts you, since it converts the inputs into numpy arrays, doing an unnecessary and expensive type conversion and producing the errors you are seeing. You are much better off using a function with a for loop.

It also is better to use a function rather than a lambda for a to-level function, since it lets you have a docstring.

So this is how I would implement what you are doing:

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a numpy array where each row is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    return [vincenty(p1, p2i).meters for p2i in p2]

If you really want to use vectorize, you can use the excluded argument to not vectorize the p1 argument, or better yet set up a lambda that wraps vincenty and only vectorizes the second argument:

def vect_dist_funct(p1, p2):
    """Apply `vincenty` to `p1` and each element of `p2`.

    Iterate over `p2`, returning `vincenty` with the first argument
    as `p1` and the second as the current element of `p2`.  Returns
    a list where each value is the result of the `vincenty` function
    call for the corresponding element of `p2`.
    """
    vinc_p = lambda x: vincenty(p1, x)
    return np.vectorize(vinc_p)(p2)
TheBlackCat
  • 9,791
  • 3
  • 24
  • 31
  • I didn't quite understand what the question is, but your answer sounds very reasonable:) – Andras Deak -- Слава Україні Jun 28 '16 at 18:19
  • Thank you. So `for` loop is the best implementation. I wish there was something parallel. I am on a Windows server, so executing things in parallel using the multiprocessing library consumes a lot of memory. – Kartik Jun 28 '16 at 18:38
  • 1
    The `vincenty` algorithm is inherently iterative, so it is difficult if not downright impossible to effectively vectorize. – TheBlackCat Jun 28 '16 at 18:43
  • Actually this one is turning out to be about 3 minutes slower that my implementation in the question for about 10% of the state of Colorado. I wonder why that might be, do you have any ideas? I am computing the spatial index based intersection in an apply, and then iterating over rows and computing distances. – Kartik Jun 29 '16 at 19:29
  • Like this: `points['intersect'] = points['coords'].apply(lambda x: list(tree_idx.intersection(x)))`, then: `for i, row in points.iterrows(): points.loc[i, 'dist'] = vect_dist_funct(row['coords'], lines.loc[row['intersect'], 'centroid']` – Kartik Jun 29 '16 at 19:34
  • Actually, with your help I did manage to write a code that is faster, thank you for that. I would like to explain that by vectorize I mean execute computations on rows as simultaneously as possible. Since each row is independent of the others, they can be computed in parallel. – Kartik Jul 01 '16 at 18:40
  • That isn't vectorization (which is a different principle), it is parallelization. That can be done using the Python `multiprocessing` toolbox, although I don't know how much your code would benefit from it. – TheBlackCat Jul 01 '16 at 20:10