Labelled datatypes Python

Question

I am computing geodesic distances between a point and multiple line segments. Each line segment has a unique identifying number. I want to return distances from my distances function such that they are both intrinsically tied together. I would also like to maintain functionality, as in sort the distances, and index them with either the label or position, and get back both the distance data and the label. Something like a Pandas Series with an index, but I cannot use a series because the data is returned into a Pandas DataFrame, which then expands the series and makes a mess. Here is an example:

In [1]: '''Note that all this happens inside an apply function of a Pandas Series'''
        labels = [25622, 25621, 25620, 25619, 25618]
        dist = vect_dist_funct(pt, labels) #vect_dist_funct does the computations, and returns distances in meters
        dist
Out[1]: array([296780.2217658355, 296572.4476883276, 296364.21166884096,
               296156.4366241771, 295948.6610171968], dtype=object)

What I want however, is something like this dict, where the labels and distances are inherently tied to each other:

{25622 : 296780.2217658355,
 25621 : 296572.4476883276,
 25620 : 296364.21166884096,
 25619 : 296156.4366241771,
 25618 : 295948.6610171968}

But now I have lost functionality of the values. I cannot easily sort them, or compare them, or anything. I looked at Numpy Structured Arrays, and they seem workable, but if I am not able to sort the distances, and get the index of the closest segment, it will not be of much use to me. Is there any other datatype that I can use?

Long Story and Background

I am trying to do a spatial join. I get the indexes of the segments a point is most likely closer to by searching in a RTree (example). Those are the indexes in labels. Then I look through the line geometries table to find the line geometry for those selected labels, and compute the distances of the points to each of the line segment.

Next steps involve sanity checking the spatial join. Nearest is not the best join candidate in some cases, and the join needs to be evaluated on other parameters. Therefore, my plan is to work from closest segment outward. Which would involve sorting on the distances, and getting the indexes of the closest segment, then looking through the segment table with that index and extracting other properties of the line for inspection. If a match can be confirmed, the said segment is accepted, else, it is rejected, and the algorithm would move to the next closest segment.

A data type that does all this is what I am looking for, without breaking the link between the distances the segment from which it was computed.

Problem With Using Pandas

So this is how the function is actually being called:

joined = points['geometry'].apply(pointer, centroid=line['centroid'], tree_idx=tree_idx))

Then inside pointer, this happens:

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point.bounds))
    if len(intersect) > 0:
        points = pd.Series([point.coords[0]]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        return pd.Series(dist, index=intercept, name='Dist').sort_values()
    else:
        return pd.Series(np.nan, index=[0], name='Dist')

And then, joined looks like this:

This is because distances between all points (the rows are points) and all lines (the columns are lines) are not computed. That would be too cost prohibitive (4M points, and 180k lines per state, and 50 states on the whole dataset). Also, this DataFrame merge operation to produced joined increases the run time 7 times, compared to when I return two Numpy arrays. The problem with returning two Numpy arrays is that it is not easy to keep the distance and the line IDs aligned all the time.

Examples of Points, Lines, tree_idx

Note that this is truncated dataset in columns and rows. I am only including the columns of relevance, and not the rest of the data:

points:

                        geometry
id      
88400001394219  0.00    POINT (-105.2363291 39.6988139)
                0.25    POINT (-105.2372017334178 39.69899060448157)
                0.50    POINT (-105.2380177896182 39.69933953105642)
                0.75    POINT (-105.2387202141595 39.69988447162143)
                1.00    POINT (-105.2393222 39.7005405)
88400002400701  0.00    POINT (-104.7102833 39.8318348)
                0.25    POINT (-104.7102827 39.831966625)
                0.50    POINT (-104.7102821 39.83209845)
                0.75    POINT (-104.7102815 39.832230275)
                1.00    POINT (-104.7102809 39.8323621)

So this is basically interpolated points on lines. The line id is the first level of index, and the second level is the percent where the point was interpolated. This forms the first dataset, the dataset to which I want to bring some attributes from the second dataset.

line:

        geometry                                            centroid
id      
71345   POLYGON ((-103.2077992965318 40.58026765162965...   (-103.20073265160862, 40.576450381964975)
71346   POLYGON ((-103.2069505830457 40.58155121711739...   (-103.19987394433825, 40.57774903464972)
71347   POLYGON ((-103.2061017677045 40.58283487609803...   (-103.19901204453959, 40.57905245493993)
71348   POLYGON ((-103.2052000154291 40.58419853220472...   (-103.19815200508097, 40.58035300329024)
71349   POLYGON ((-103.2043512639656 40.58548197865339...   (-103.19729445792181, 40.58164972491414)
71350   POLYGON ((-103.2035025651746 40.5867652936463,...   (-103.1964362470977, 40.5829473948391)
71351   POLYGON ((-103.2026535431035 40.58804903349249...   (-103.19557847342394, 40.58424434094705)
71352   POLYGON ((-103.201804801526 40.58933229190573,...   (-103.19472966696722, 40.58552767098465)
71353   POLYGON ((-103.2009557884142 40.59061590473365...   (-103.19388484652855, 40.58680427447224)
71354   POLYGON ((-103.2001001699726 40.59190793446012...   (-103.19303392095904, 40.5880882237994)

This is part of the second dataset (the labels mentioned at the beginning of this answer is the index of this dataset). The goal is to transfer attributes from this dataset to the points dataset, in an intelligent manner. The first step of which is to find the nearest line to each of the points. Then I will compare some attributes from the points dataset with the lines dataset, and confirm or reject a join, like I mentioned.

tree_idx:

tree_idx is created using the following code:

import rtree
lines_bounds = lines['geometry'].apply(lambda x: x.bounds)
tree_idx = rtree.index.Index()
for i in lines_bounds.index:
    tree_idx.insert(i, lines_bounds.loc[i])

Could you elaborate on the functionality you would like to have but can not achieve with a dictionary? — michael_j_ward, Jun 24 '16 at 17:24
I am trying to do a spatial join. And in doing so, the closest is not always the correct join partner. There are other parameters that I will need to look at before confirming a join. So my plan is to move from the closest outward, checking and confirming or rejecting as I go. Dictionaries are not structured, and cannot be stored sorted. Also, dictionaries are not indexable by position, and I cannot know the key for all segments before hand. — Kartik, Jun 24 '16 at 17:35
Can you explain why using pandas Series/DataFrame does not work in your case? As pandas is exactly meant for the situation you explain (keeping link between values and labels, but still being able to use the values, sort them, ..) — joris, Jun 24 '16 at 18:00
Further, maybe `geopandas` would be interesting to look at: http://geopandas.readthedocs.io/en/latest/index.html — joris, Jun 24 '16 at 18:00
@joris: The problem with using Pandas Series for this, though it fits the use case perfectly, is that I am calling the distance function from within an apply on another GeoPandas DataFrame. The apply is then expanding the returned Series and the resulting DataFrame has about 180k columns (mostly NaNs, because the distance to each point is not computed with each line). Also I have about 4M points, so it is just a huge table of mostly NaNs at the end. Not to mention the compute time increases 7 times for the DataFrame combine to happen. — Kartik, Jun 24 '16 at 18:06
I see. Maybe you could try to change the applied method, so it does not return the huge expanded dataframe? (eg doing the selection within the apply, so you do not return a series but a single value from the apply) But a bit difficult to say without a concrete small example dataset :-) — joris, Jun 24 '16 at 18:09
@Kartik: It sounds like you are doing your calculation in an inefficient manner. I think we would need to see the whole code to help. I think this is an example of the XY problem, where you are asking for data structure help while your problem is probably in your algorithm instead. — TheBlackCat, Jun 24 '16 at 18:09
@TheBlackCat: Does the edit under Problem with Pandas in the question help with the algorithm I am using? Everything before that is just housekeeping: reading the data, identifying types, setting buffers, creating the RTree index, etc. The workhorse of this algorithm is the `pointer` function in the question. — Kartik, Jun 24 '16 at 18:23
@Kartik: Can you provide small examples of `points`, `line`, and `tree_idx`? I think I am getting a better idea, but it is still hard without having some idea how the data is structured. — TheBlackCat, Jun 24 '16 at 18:29
There, @TheBlackCat, done. I hope this has not caused more confusion (I have a feeling it might). — Kartik, Jun 24 '16 at 18:43
Classes like `np.ma.MaskedArray` and `scipy.sparse` define array like objects that use 'real' arrays as attributes. Then they just define a bunch of methods that coordinate the use and creation of those arrays. Such a class does not have to have full array compatibility, just the methods that are important to your task. — hpaulj, Jun 24 '16 at 19:06

score 1 · Accepted Answer · answered Jun 24 '16 at 20:20

1

So I think your overall problem is you are creating a DataFrame where the column label is the intercept value. I think what you want to do is create a DataFrame where one column contains the intercept values, while another contains the distances. I will try to give you code that I think will help, but it is hard to be certain without having your original data so you many need to modify it somewhat to get it to work perfectly.

First, I would modify vect_dist_funct so if the first argument is a scalar, it creates the correct-length list, and if the second is empty it returns NaN.

Next I would add all the useful values as columns to the DataFrame:

points['intersect'] = points['geometry'].apply(lambda x: np.array(tree_idx.intersection(x.bounds)))
points['polygons'] = points['intersect'].apply(lambda x: centroid.loc[x].values)
points['coords0'] = points['geometry'].apply(lambda x: x.coords[0])
points['dist'] = points.apply(lambda x: vect_dist_funct(x.coords0, x.polygons), axis=1)

This will give you a column with all the distances in it. If you really want the intercept values to be accessible, you can then create a DataFrame with just the intercepts and distances, and then put the intercepts as another multiindex level to avoid too many NaN values:

pairs = points.apply(lambda x: pd.DataFrame([x['intersect'], x['dist']], index=['intersect', 'dist']).T.stack(), axis=1)
pairs = pairs.stack(level=0).set_index('intersect', append=True)
pairs.index = pairs.index.droplevel(level=2)

This should give you a Series where the first index is the id, the second is the percent, the third is the intersect, and the value is the distance.

answered Jun 24 '16 at 20:20

TheBlackCat

9,791
3
24
31

Kudos! Thank you much for this! It caused a few "ah-ha", and resulted in a couple of forehead slaps. I was so caught up in my way, that I got stuck there. – Kartik Jun 24 '16 at 20:49
And as I implemented your answer, it turns out there are quite a bit of problems: Firstly, `tree_idx.intersection` gives a generator. But that was easily fixed by using `centroid.loc[list(x)]`. Then, since this would return a list in some places, distance calculation is not that straightforward. I am still working to figure out how to do that... – Kartik Jun 24 '16 at 21:59
Any ideas @TheBlackCat? The problem is that `points['coords0']` has a tuple of x and y coordinates. However `points['polygons']` is a list of tuples containing x and y coordinates. The length of this list can be anywhere from 0 to 151 (perhaps more in a really dense area). So basically, 'coords0' needs to be repeated to be as long as 'polygons', and distance computed for each element of this list, for all rows in the `points` DataFrame... – Kartik Jun 24 '16 at 22:43
@Kartik: That calculation is done in `vect_dist_funct`, right? Is there a reason you can't just do the calculation with `coords0` each time, rather than matching corresponding elements of `coords0` and `polygons`? If that is a real problem, you can do the duplication inside of `vect_dist_funct`, but I am having trouble thinking of a scenario where that sort of duplication is really necessary. – TheBlackCat Jun 27 '16 at 17:34
So `vect_dist_funct` calls Vincenty distance function available in Geopy. And numpy needs both arguments to be of the same shape when passing to vectorised function (hence the awkward conversion to Series inside `pointer`). Also I need the distances between points and the polygons. So I don't see how I can do what you are suggesting. – Kartik Jun 27 '16 at 18:05
1

@Kartik: Currently, you create many copies of `coords[0]` in `Pointer`. What I am saying is that you can make those copies inside `vect_dist_funct` instead. You probably don't even need to make an explicit copy, looking at the geopy source code `itertools.repeat` should do the trick, avoiding having to store a huge number of duplicate values in memory. – TheBlackCat Jun 27 '16 at 18:21
It doesn't work. `itertools.repeat()` does not repeat the tuple of `(x, y)` coordinates. It instead repeats `x` and `y` separately, prompting vincenty to complain that it got too many arguments. I don't know what I am missing. I had this same problem even with simple `list*number`. Therefore, I use the awkward conversion to pandas Series and then get values from that. – Kartik Jun 28 '16 at 00:20
@Kartik: How are you calling `itertools.repeat`? I tried it with `repeat((5, 6))` and `repeat((5, 6), 10)` and both returns a repeating iterable of `(5, 6)` tuples. `list*number` won't work unless you wrap the tuple in a list, such as `[(5, 6)]*10`. But it may be hard to help you more without seeing `vect_dist_funct`. – TheBlackCat Jun 28 '16 at 14:30
That is exactly how I call repeat as well. Perhaps it would be prudent if I moved this over to another question about `vect_dist_funct`, and the best way to pass values to it such that memory consumption is minimum. I'll post the link to that question here. – Kartik Jun 28 '16 at 17:06
Here: http://stackoverflow.com/questions/38082936/best-way-to-pass-repeated-parameter-to-a-numpy-vectorized-function. Thanks for your help. I really appreciate it. – Kartik Jun 28 '16 at 17:46

score 0 · Answer 2 · answered Jun 24 '16 at 18:03

0

So, I think a data-frame whose index is the labels is probably the simplest

distances = {25622 : 296780.2217658355,
 25621 : 296572.4476883276,
 25620 : 296364.21166884096,
 25619 : 296156.4366241771,
 25618 : 295948.6610171968}

df = pd.DataFrame([tup for tup in distances.items()],columns=["label", "dist"]).sort_values('dist').set_index('label')
df

Outputs:

    dist
label   
25618   295948.661017
25619   296156.436624
25620   296364.211669
25621   296572.447688
25622   296780.221766

Then if you want to access a distance by label name

df.loc[25620]
Out:
dist    296364.211669
Name: 25620, dtype: float64

And then if you want to find labels 'near' that point, you can get the row-number with

row_num = df.index.get_loc(25620)
print(row_num)
Out: 2

And then you can access "near" points with df.iloc[row_number]

df.iloc[3]
Out: 
dist    296572.447688
Name: 25621, dtype: float64

Does that cover everything you need?

answered Jun 24 '16 at 18:03

michael_j_ward

4,369
1
24
25

No, because if I return a Series/DataFrame, it expands the whole thing into a lot of NaN values. The calling function is an apply on a GeoDataFrame, so when the return type is another Pandas Series, it is combined into the output making the whole process take too long. – Kartik Jun 24 '16 at 18:08
Wait, I assumed from your question that you can get your data into `{key: value}` pairs, is that not so? – michael_j_ward Jun 24 '16 at 18:09
This solution is both easily sortable and easily comparable and maintains the relationship between a label and its distance while maintaining the ability to easily find 'nearby' labels. How is that not exactly what you asked for? – michael_j_ward Jun 24 '16 at 18:14
I can get them into key : value pairs, Michael, but see the edit in the question, please. I have tried Pandas Series already. – Kartik Jun 24 '16 at 18:20
It sounds to me like you are too married to `df.apply`. The logic of what you are trying to do is more "for each column, compute the distances, check the shortest distance, perform sanity check and repeat until we find the correct labe. continue for the next column" Maybe you should split up your actions to more properly conform to the logic of your process – michael_j_ward Jun 24 '16 at 18:30
Honestly, I have tried 10 different things to not use `df.apply`. But it turns out that `df.apply` is the best way to do what I am trying to do. – Kartik Jun 24 '16 at 18:46
That does not make any sense. `df.aply(func)` is syntactic sugar for `for column in df.columns: func(df[column])` with the additional funcitonality of shaping the return result- and it is this shaping that is giving you problems. So, if you use the latter, then you avoid this forced reshaping and can more directly proceed with your process logic. How in the world could `df.apply` be the only way? – michael_j_ward Jun 24 '16 at 18:55
Because, `df.apply` is faster than the `for` loop. I tried. I am not lying about it. The thing is that `df.apply` can do stuff, that a normal for loop cannot. For example, this: http://pandas.pydata.org/pandas-docs/version/0.18.1/groupby.html#flexible-apply (scroll down to read the warning). – Kartik Jun 24 '16 at 19:15

score 0 · Answer 3 · answered Jun 24 '16 at 23:26

After everything, and after trying to make TheBlackCat's answer work for about 3 hours, I have decided to use xarray. So now the pointer function looks like this:

def pointer(point, centroid, tree_idx):
    intersect = list(tree_idx.intersection(point.bounds))
    if len(intersect) > 0:
        points = pd.Series([point.coords[0]]*len(intersect)).values
        polygons = centroid.loc[intersect].values
        dist = vect_dist_funct(points, polygons)
        sorter = np.argsort(dist)
        return xr.DataArray(dist[sorter], [('dim0', np.asarray(intersect)[sorter])])
    else:
        return xr.DataArray(np.nan)

Done. This works for my needs. I have the distances and the segment ID from which they have been computed together, such that transformations on one, affect the other. And the distances are still operable, and xarray also gives me advanced functionality in terms of grouping, merging, etc.

Also, this takes about a minute to run on 0.1% of the data for a state, and 10 minutes for 10% of the data. Therefore, I am expecting a 100% of the data is about 100 minutes. But honestly, even if it took 3 hours for a state, I can still finish all 50 states within a day (using multithreading on a 16 core server). So I am happy with this for the time being. Thanks to all suggestions I got. Especially @TheBlackCat, @michael_j_ward, and @hpaulj.

Labelled datatypes Python

3 Answers3

Linked