6

Data

I have a dataframe that contains 5 columns:

  • Latitude and longitude of origin (origin_lat, origin_lng)
  • Latitude and longitude of destination (dest_lat, dest_lng)
  • A score which was computed from the other fields

I have a matrix M that contains pairs of origin and destination latitude/longitude. Some of these pairs exists in the dataframe, other do not.

Goal

My goal is two-fold:

  1. Select all pairs from M that are not present in the first four column of the dataframe, apply a function func to them (to calculate the score column), and append the results to the existing dataframe. Note: We should not recalculate the score for already existing rows.
  2. After adding the missing rows, select all the rows defined by the selection matrix M in a new dataframe dfs.

Example code

# STEP 1: Generate example data
ctr_lat = 40.676762
ctr_lng = -73.926420
N = 12
N2 = 3

data = np.array([ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10,
                 ctr_lat+np.random.random((N))/10,
                 ctr_lng+np.random.random((N))/10]).transpose()

# Example function - does not matter what it does
def func(x):
    return np.random.random()

# Create dataframe
geocols = ['origin_lat','origin_lng','dest_lat','dest_lng']
df = pd.DataFrame(data,columns=geocols)
df['score'] = df.apply(func,axis=1)

Which gives me a dataframe df like this:

    origin_lat  origin_lng   dest_lat   dest_lng     score
0    40.684887  -73.924921  40.758641 -73.847438  0.820080
1    40.703129  -73.885330  40.774341 -73.881671  0.104320
2    40.761998  -73.898955  40.767681 -73.865001  0.564296
3    40.736863  -73.859832  40.681693 -73.907879  0.605974
4    40.761298  -73.853480  40.696195 -73.846205  0.779520
5    40.712225  -73.892623  40.722372 -73.868877  0.628447
6    40.683086  -73.846077  40.730014 -73.900831  0.320041
7    40.726003  -73.909059  40.760083 -73.829180  0.903317
8    40.748258  -73.839682  40.713100 -73.834253  0.457138
9    40.761590  -73.923624  40.746552 -73.870352  0.867617
10   40.748064  -73.913599  40.746997 -73.894851  0.836674
11   40.771164  -73.855319  40.703426 -73.829990  0.010908

I can then artificially create the selection matrix M which contains 3 rows that exists in the dataframe, and 3 rows that do not.

# STEP 2: Generate data to select
# As an example, I select 3 rows that are part of the dataframe, and 3 that are not
data2 = np.array([ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10,
                  ctr_lat+np.random.random((N2))/10,
                  ctr_lng+np.random.random((N2))/10]).transpose()

M = np.concatenate((data[4:7,:],data2))

The matrix M looks like this:

array([[ 40.7612977 , -73.85348031,  40.69619549, -73.84620489],
       [ 40.71222463, -73.8926234 ,  40.72237185, -73.86887696],
       [ 40.68308567, -73.84607722,  40.73001434, -73.90083107],
       [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
       [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
       [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])

From here, I do not know how to know which rows from M are not present in df and add them. I do not know either how to select all the rows from df that are in M.

Ideas

My idea was to identify the missing rows, append them to df with a nan score, and recompute the score for the nan rows only. However, I do not know how to select these rows efficiently without looping on each element of the matrix M.

Any suggestion? Thanks a lot for your help!

nbeuchat
  • 6,575
  • 5
  • 36
  • 50

5 Answers5

8

Is there any reason not to use merge ?

df2 = pd.DataFrame(M, columns=geocols) 
df = df.merge(df2, how='outer')
ix = df.score.isnull()
df.loc[ix, 'score'] = df.loc[ix].apply(func, axis=1)

It does exactly what you proposed : adds the missing rows df with a nan score, identifies nans, calculates the scores for those rows.

igrinis
  • 12,398
  • 20
  • 45
  • Would you use a `inner` join at the end to select all the rows from matrix M? – nbeuchat Sep 20 '17 at 13:06
  • 1
    No. The inner join will give you values that are present in both tables. This will make `df` a subset of what it was prior the operation. If you need to know which lines are present in `df` and absent in `M` you can convert those 4 columns to `Index` and use [difference()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.difference.html#pandas.Index.difference) operator – igrinis Sep 21 '17 at 15:07
2

So this solution does loop over each row in M, but not each element. The steps are:

  1. Go through each row in M and identify if it is in df or not. If it is in it, save the index. If it is not, calculate the score and save.
  2. Create the M dataframe by taking the new M rows from above and appending the rows found in df.
  3. Create the new version of the dataframe by just appending the new rows of M.

Hopefully this helps - I realise it still has a loop in it but I have not figured out how to get rid of it. Your question also only states that df could be big, and that you wanted to avoid looping elements of M, which this at least avoids by only looping rows.


M_in_df = []
M_not_in_df = []

for m in M:
    df_index = (df.iloc[:,:4].values == m).all(axis=1)
    if df_index.any():
        M_in_df.append(np.argmax(df_index))
    else:
        M_not_in_df.append(np.append(m, func(m)))    

M_df = pd.DataFrame(M_not_in_df, columns=df.columns).append(df.iloc[M_in_df], ignore_index=True)

new_df = df.append(pd.DataFrame(M_not_in_df, columns=df.columns), ignore_index=True)
Ken Syme
  • 3,532
  • 2
  • 17
  • 19
2

Convert M to a DataFrame, concat with df:

df2 = pd.DataFrame(M, columns=geocols)
df3 = pd.concat([df, df2], ignore_index=True)

Drop the duplicate rows only based on the cols in geocols:

df3 = df3.drop_duplicates(subset=geocols)

Get a mask of rows with NaN for score:

m = df3.score.isnull()

Apply the score to the masked rows, and store in df3:

df3.loc[m, 'score'] = df3[m].apply(func, axis=1)

You will get a SettingWithCopyWarning, but it works.

Kyle
  • 2,814
  • 2
  • 17
  • 30
2

You are doing geospatial analysis, I think it is very important to incorporate some standard approaches. Namely, each of your row / entry is identified by a pair of coordinates, therefore, to convert them into WKT, would make a lot of sense.

With WKT, all what you need to check is whether the WKT of the new data is already found in the old data:

# from shapely.wkt import dumps
# import shapely.geometry as sg

In [27]: M = np.array([[ 40.761998, -73.898955, 40.767681, -73.865001],
    ...:               [ 40.736863, -73.859832, 40.681693, -73.907879],
    ...:               [ 40.761298, -73.853480, 40.696195, -73.846205],
    ...:               [ 40.7588412 , -73.87128079,  40.76750639, -73.91945371],
    ...:               [ 40.74686156, -73.84804047,  40.72378653, -73.92207075],
    ...:               [ 40.6922673 , -73.88275402,  40.69708748, -73.87905543]])
In [28]: df
Out[28]: 
    origin_lat  origin_lng   dest_lat   dest_lng     score  
0    40.684887  -73.924921  40.758641 -73.847438  0.820080   
1    40.703129  -73.885330  40.774341 -73.881671  0.104320   
2    40.761998  -73.898955  40.767681 -73.865001  0.564296   
3    40.736863  -73.859832  40.681693 -73.907879  0.605974   
4    40.761298  -73.853480  40.696195 -73.846205  0.779520   
5    40.712225  -73.892623  40.722372 -73.868877  0.628447   
6    40.683086  -73.846077  40.730014 -73.900831  0.320041   
7    40.726003  -73.909059  40.760083 -73.829180  0.903317   
8    40.748258  -73.839682  40.713100 -73.834253  0.457138   
9    40.761590  -73.923624  40.746552 -73.870352  0.867617   
10   40.748064  -73.913599  40.746997 -73.894851  0.836674   
11   40.771164  -73.855319  40.703426 -73.829990  0.010908   

# Generate WKT for the original dataframe
In [29]: df['wkt'] = df.apply(lambda x: dumps(sg.LineString([x[:2], x[2:4]]),
                                              rounding_precision=6),
                              axis=1)

In [29]: df
Out[29]: 
    origin_lat  origin_lng   dest_lat   dest_lng     score                                                 wkt
0    40.684887  -73.924921  40.758641 -73.847438  0.820080   LINESTRING (40.684887 -73.924921, 40.758641 -7...
1    40.703129  -73.885330  40.774341 -73.881671  0.104320   LINESTRING (40.703129 -73.885330, 40.774341 -7...
2    40.761998  -73.898955  40.767681 -73.865001  0.564296   LINESTRING (40.761998 -73.898955, 40.767681 -7...
3    40.736863  -73.859832  40.681693 -73.907879  0.605974   LINESTRING (40.736863 -73.859832, 40.681693 -7...
4    40.761298  -73.853480  40.696195 -73.846205  0.779520   LINESTRING (40.761298 -73.853480, 40.696195 -7...
5    40.712225  -73.892623  40.722372 -73.868877  0.628447   LINESTRING (40.712225 -73.892623, 40.722372 -7...
6    40.683086  -73.846077  40.730014 -73.900831  0.320041   LINESTRING (40.683086 -73.846077, 40.730014 -7...
7    40.726003  -73.909059  40.760083 -73.829180  0.903317   LINESTRING (40.726003 -73.909059, 40.760083 -7...
8    40.748258  -73.839682  40.713100 -73.834253  0.457138   LINESTRING (40.748258 -73.839682, 40.713100 -7...
9    40.761590  -73.923624  40.746552 -73.870352  0.867617   LINESTRING (40.761590 -73.923624, 40.746552 -7...
10   40.748064  -73.913599  40.746997 -73.894851  0.836674   LINESTRING (40.748064 -73.913599, 40.746997 -7...
11   40.771164  -73.855319  40.703426 -73.829990  0.010908   LINESTRING (40.771164 -73.855319, 40.703426 -7...

# Generate WKT for the new data
In [30]: new_wkt = [dumps(sg.LineString(r.reshape(2,2)), 
                          rounding_precision=6)
                    for r in M]
In [30]: np.isin(new_wkt, df.wkt)
Out[30]: array([ True,  True,  True, False, False, False], dtype=bool)

# Only put the WKT not found in the original dataframe into the a new dataframe
In [31]: df2 = pd.DataFrame(M[np.isin(new_wkt, df.wkt)], columns=['origin_lat', 'origin_lng', 'dest_lat', 'dest_lng'])
In [32]: df2['wkt'] = np.array(new_wkt)[np.isin(new_wkt, df.wkt)]

# Only do calculation for the new entries
In [33]: df2['score'] = 0  # or do whatever score calculation needed

# Combine the new to the old
In [34]: df.append(df2)
Out[34]: 
     dest_lat   dest_lng  origin_lat  origin_lng     score                                                wkt
0   40.758641 -73.847438   40.684887  -73.924921  0.820080  LINESTRING (40.684887 -73.924921, 40.758641 -7...
1   40.774341 -73.881671   40.703129  -73.885330  0.104320  LINESTRING (40.703129 -73.885330, 40.774341 -7...
2   40.767681 -73.865001   40.761998  -73.898955  0.564296  LINESTRING (40.761998 -73.898955, 40.767681 -7...
3   40.681693 -73.907879   40.736863  -73.859832  0.605974  LINESTRING (40.736863 -73.859832, 40.681693 -7...
4   40.696195 -73.846205   40.761298  -73.853480  0.779520  LINESTRING (40.761298 -73.853480, 40.696195 -7...
5   40.722372 -73.868877   40.712225  -73.892623  0.628447  LINESTRING (40.712225 -73.892623, 40.722372 -7...
6   40.730014 -73.900831   40.683086  -73.846077  0.320041  LINESTRING (40.683086 -73.846077, 40.730014 -7...
7   40.760083 -73.829180   40.726003  -73.909059  0.903317  LINESTRING (40.726003 -73.909059, 40.760083 -7...
8   40.713100 -73.834253   40.748258  -73.839682  0.457138  LINESTRING (40.748258 -73.839682, 40.713100 -7...
9   40.746552 -73.870352   40.761590  -73.923624  0.867617  LINESTRING (40.761590 -73.923624, 40.746552 -7...
10  40.746997 -73.894851   40.748064  -73.913599  0.836674  LINESTRING (40.748064 -73.913599, 40.746997 -7...
11  40.703426 -73.829990   40.771164  -73.855319  0.010908  LINESTRING (40.771164 -73.855319, 40.703426 -7...
0   40.767681 -73.865001   40.761998  -73.898955  0.000000  LINESTRING (40.761998 -73.898955, 40.767681 -7...
1   40.681693 -73.907879   40.736863  -73.859832  0.000000  LINESTRING (40.736863 -73.859832, 40.681693 -7...
2   40.696195 -73.846205   40.761298  -73.853480  0.000000  LINESTRING (40.761298 -73.853480, 40.696195 -7...

Additional comments:

  1. With geospatial information coded in WKT/WKB, one can easily use available geospatial libraries to calculate the score column (if it involves such calculations)
  2. Setting correct precision for WKT is usually a necessary consideration for geospatial data (here I set it to precision to 6)
  3. Performance. The dimension of df (or the df.append(df2) part) changes every time you add new rows from a new dataframe in this case df2 to it. Inherently, this means performance will surfer if such 'update's happens a lot.
  4. If the analysis is build around geospatial data, geopandas might worth looking into.
CT Zhu
  • 52,648
  • 17
  • 120
  • 133
0

Let's first shape M as a dataframe named df_temp:

In [1]: df_temp=pd.DataFrame(M,columns=('origin_lat','origin_lng','dest_lat','dest_lng'))

In [2]: df_temp
Out[2]:
   origin_lat  origin_lng   dest_lat   dest_lng
0   40.724872  -73.843830  40.768628 -73.875295
1   40.744625  -73.858908  40.770675 -73.915897
2   40.683664  -73.916877  40.700891 -73.904609
3   40.774582  -73.871768  40.703176 -73.833921
4   40.680940  -73.839505  40.752041 -73.882552
5   40.677105  -73.897702  40.743859 -73.883683

Using merge, we can now easily keep track of the elements in df that were in M :

In [3]: dfs = df.merge(df_temp,on=['origin_lat','origin_lng','dest_lat','dest_lng'],
right_index=True)

In [4]: dfs
Out[4]:
   origin_lat  origin_lng   dest_lat   dest_lng     score
4   40.724872  -73.843830  40.768628 -73.875295  0.705182
5   40.744625  -73.858908  40.770675 -73.915897  0.724282
6   40.683664  -73.916877  40.700891 -73.904609  0.645395

Note : the right_index argument allows us to keep the indexing of df, so that we know which lines of df were also in M

Finally, we can add lines in df_temp that were not in df :

# Compute the scores of df_temp
df_temp['score'] = [func(df_temp.iloc[i]) for i in range(len(df_temp))]
# Append elements of df_temp to df
df.append(df_temp,ignore_index=True)
# Erase duplicates
df.drop_duplicates(subset=['origin_lat','origin_lng','dest_lat','dest_lng'])

Note : the subset in drop_duplicates is just here because your score function is non-deterministic

plalanne
  • 1,010
  • 2
  • 13
  • 30