1

I could use some help.

The main problem is to calculate the distance between two points with their latitude and longitude. We have divided Brazil into 33k hexagons, listed in the dataframe below:

image

I've been trying to merge this dataframe with its copy so i would have a 1 billion row dataframe with all combinations of those hexagons and calculate the distances between them with this function:

def get_distance(lat_1, lng_1, lat_2, lng_2): 
    d_lat = lat_2 - lat_1
    d_lng = lng_2 - lng_1 

    temp = (  
         np.sin(d_lat / 2) ** 2 
       + np.cos(lat_1) 
       * np.cos(lat_2) 
       * np.sin(d_lng / 2) ** 2
    )
    
    #print('a')

    return 6373.0 * (2 * np.arctan2(np.sqrt(temp), np.sqrt(1 - temp))) * 1.4 * 1000

I tried merging them with Pandas and got a memory error (needed 8GB), so i've used Vaex library to convert the data to hdf5 files. However, when i try to merge those with this code, i get the same error.

with h5py.File('mergedfs', 'w') as hdf:
    hdf.create_dataset('datasetmerge', data = dvv.join(dvv2, left_on='key', right_on='key2', how='left', allow_duplication=True))

Does anyone have been through something like this before? I aprecciate the help in advance.

Also, if you have any alternative solutions, i'd be glad to hear!

TayTay
  • 6,882
  • 4
  • 44
  • 65
  • You need a parallelization or out-of-core framework. I won't suggest packages here, but my engineers have used these at work with frames > 250GB. – TomServo Jan 21 '21 at 21:50
  • Another alternative when you have limited memory is to read and operate on slices of the data, then write those slices to the HDF5 file. Do this in a loop until all slices have been processed. I am not familiar with Pandas, so don't know if it can read slices. You can certainly do this with h5py or Pytables. I just wrote an example that does an element wise sum of 2 matrices and writes to another file. See this Answer: https://stackoverflow.com/a/65812444/10462884 – kcw78 Jan 21 '21 at 22:06
  • @kcw78 thank you for your answer! The matter is i've been trying to calculate the vector product between the points, but your code calculates the sum of the elements of two equalsized datasets, returning a third dataset with the same shape. I'm looking for a way to slice the problem and maybe get separated output datasets so i could append them and get the 1 billion row frame (33k x 33k). How could I concatenate the lat and long columns in both datasets and calculate the vector product between them to have all combinations of hexagons, and then get a 4 column dset with their coordinates? – Victor Ferreira Jan 22 '21 at 13:39
  • Yes my example is a simple sum. I shared it to show how you can read a slice of data and do math operations on it. I need more details about the schema of the files (dtype in h5py/numpy terms). It's not clear why you need 2 files (if file 2 is a copy of file 1). If I understand, `lat_1` and `lng_1` come from one row and `lat_2` and `lng_2` come from a different row. You could do this with 2 loops, where the outer loop `i=1 to 33k` and the inner loop `j=i to 33k`. – kcw78 Jan 22 '21 at 15:16
  • Do you really need all distances? – Willem Hendriks Jan 22 '21 at 19:51

1 Answers1

0

Without the dytpe for your dataset, I can't write an exact solution. However, as I look at your code and file image, I can take a guess at what the psuedo-code should look like. See below. I used assumed names for your HDF5 file and the dataset. I used the column/field names shown in your image for the latitude and longitude data. This simply loops, calls get_distance() to calculate the distance and returns the value as dist_i_j. Note: I call your function get_distance() as-is. Hopefully this will get you started. Or, if you can provide more info about the dtype (or some example data), I can fine tune the code. Good luck!

with h5py.File('not_mergedfs', 'w') as hdf:
    data=h5f['ds_name']
    
    for i in range (32882):
        lat_1 = data[i]['lat']
        lng_1 = data[i]['lon']
  
        for j in range (i+1,32882):
            lat_2 = data[j]['lat']
            lng_2 = data[j]['lon']
    
            dist_i_j = get_distance(lat_1, lng_1, lat_2, lng_2)
kcw78
  • 7,131
  • 3
  • 12
  • 44