1

I have a large dataframe containing location data from various ships around the world. imoNois the ship identifier. Below is a sample of the dataframe: enter image description here and here is the code to reproduce it:

# intialise data of lists. 
ships_data = {'imoNo':[9321483, 9321483, 9321483, 9321483, 9321483], 
                    'Timestamp':['2020-02-22 00:00:00', '2020-02-22 00:10:00', '2020-02-22 00:20:00', '2020-02-22 00:30:00', '2020-02-22 00:40:00'],
                    'Position Longitude':[127.814598, 127.805634, 127.805519, 127.808548, 127.812370],
                    'Position Latitude':[33.800232, 33.801899, 33.798885, 33.795799, 33.792931]} 

# Create DataFrame 
ships_df = pd.DataFrame(ships_data) 

What I need to do is to add a column at the end of the dataframe which will identify the sea_name where the vessel sails. Something like the one below: enter image description here

In order to get there, I 've found a dataset in .shp format at this link (IHO Sea Areas v3) which looks like this: enter image description here

So, the way I do it is going through each (long,lat) of the ships dataset, check in which polygon is that pair within, and finally return the name of the sea of the matching polygon. This is my code:

### Load libraries
import numpy as np
import pandas as pd
import geopandas as gp
import shapely.speedups
from shapely.geometry import Point, Polygon
shapely.speedups.enable()

### Check and map lon lat pair with sea name
def get_seaname(long,lat):
    pnt = Point(long,lat)
    for i,j in enumerate(iho_df.geometry):
        if pnt.within(j):
            return iho_df.NAME.iloc[i]


### Apply the above function to the dataframe
ships_df['sea_name'] = ships_df.apply(lambda x: get_seaname(x['Position Longitude'], x['Position Latitude']), axis=1) 

However, this is a very time demanding process. I tested in locally on my Mac at the first 1000 rows of ships_df and it took around 1 minute to run. If it grows linearly, then I will need around 14 days for the whole dataset :-D.

Any idea to optimize the function above would be appreciated.

Thank you!

oikonang
  • 51
  • 11
  • 2
    Interesting question, but can you please post your sample data as text rather than image? Makes it **much** easier to work on. – Josh Friedlander Mar 26 '20 at 17:26
  • 1
    Interesting.. I can surely apply vectorization to speed up the calculations, but I am wondering how I can speedup the conversion of lat and long into a namedtuple (prerequisite to my vectorization idea) without making it stupidly slow... :( – dumbPy Mar 26 '20 at 17:35

2 Answers2

1

Finally, I have something faster than the initial question.

Firstly, I created a polygon describing the bounding box using the information from the IHO sea areas dataset

# Create a bbox polygon
iho_df['bbox'] = iho_df.apply(lambda x: Polygon([(x['min_X'], x['min_Y']), (x['min_X'], x['max_Y']), (x['max_X'], x['max_Y']), (x['max_X'], x['min_Y'])]), axis=1)

Then, I changed the function so as to look at the bbox first (which is a lot faster than the geometry since it is just a rectangle shape). When a point falls within multiple boxes (for bordering seas) then and only then it looks at the initial polygons to find the correct sea name among the matching boxes (and not within all polygons).

# Function that checks and maps lon lat pair with sea name
def get_seaname(long,lat):
    pnt = Point(long,lat)
    names = []
    # Check within each bbox first to note the polygons to look at
    for i,j in enumerate(iho_df.bbox):
        if pnt.within(j):
            names.append(iho_df.NAME.iloc[i])
    # Return nan for no return
    if len(names)==0:
        return np.nan
    # Return the single name of the sea 
    elif len(names)==1:
        return names[0]
    # Run the pnt.within() only for the polygons within the collected sea names
    else:
        limitizez_df = iho_df[iho_df['NAME'].isin(names)].reset_index(drop=True)
        for k,n in enumerate(limitizez_df.geometry):
            if pnt.within(n):
                return limitizez_df.NAME.iloc[k]

This one minimized the time significantly. To boost it even further, I used multiprocessing to parallelize among the CPU core. The idea was taken from another StackOverflow post I don't recall right now but here is the code.

import multiprocessing as mp

# Function that parallelizes the apply function among the cores of the CPU
def parallelize_dataframe(df, func, n_cores):
    df_split = np.array_split(df, n_cores)
    pool = Pool(n_cores)
    df = pd.concat(pool.map(func, df_split))
    pool.close()
    pool.join()
    return df

# Function that adds a sea_name column in the main dataframe
def add_features(df):
    # Apply the function
    df['sea_name'] = df.apply(lambda x: get_seaname(x['Position Longitude'], x['Position Latitude']), axis=1)
    return df

Finally, instead of using the apply function for the get_seaname(), I used it for the parallelize_dataframe() function to run on all available CPU cores:

### Apply the above function to the dataframe
ships_df = parallelize_dataframe(ships_df, add_features, n_cores=mp.cpu_count())

I hope my solution helps other people too!

oikonang
  • 51
  • 11
0

Try this,

Make Point out of each lat long with apply (can't realize a faster method, help welcome)

import numpy as np
import pandas as pd
import geopandas as gp
import shapely.speedups
from shapely.geometry import Point, Polygon
shapely.speedups.enable()

# I am still uncomfortable with this. More ideas on speeding up this part are welcome
ships_df['point'] = ships_df.apply(lambda x: Point(x['Position Longitude'], x['Position Latitude']), axis=1)

Now vectorize your function to work on Point

def get_seaname(pnt:Point):
    for i,j in enumerate(iho_df.geometry):
        if pnt.within(j):
            return iho_df.NAME.iloc[i]

Now since your method works for a single point, convert the point column into a vector of Point objects and vectorize your method

get_seaname = np.vectorize(get_seaname)

ships_df['sea_name'] = pd.Series(get_seaname(ships_df['point'].values))
dumbPy
  • 1,379
  • 1
  • 6
  • 19
  • Thank you for the quick response! I tried your suggestion on the 1000 rows sample and it still takes a lot of time (around 1 minute) and before it finishes I get an error: `/Users/oikonang/miniconda3/envs/geo/lib/python3.7/site-packages/numpy/lib/function_base.py:2167: RuntimeWarning: invalid value encountered in ? (vectorized) outputs = ufunc(*inputs)` – oikonang Mar 26 '20 at 18:04