0

I am new to python, so please forgive my novicity. I have two datasets one with 440k rows (file A) and the other with 10k rows (file B). Each file has a pair of latitudes and longitudes. I am trying to find the haversine distance between each coordinate in file A to each coordinate in file me and then save it to an output file with rows lat1, long1, lat2, long2, distance. While I checked for the existing for loop questions, I didn't quite understand the solution for avoiding nested for loop. So I used the following code:

##### Opening new csv file and writing headers #####
with open("distance.csv","w+") as file:
        csv_writer = writer(file)
        row=['lat1', 'long1', 'lat2', 'long2', 'distance']
        csv_writer.writerow(row)
  
#### iterate through each row and calculate the haversine distance #### 
for i in range(len(df1)) :
    for j in range(len(df2)):
        distance = haversine(df1.loc[i, "long1"], df1.loc[i,"lat1"], df2.loc[j, "long2"], df2.loc[j,"lat2"])
        with open("distance.csv","a+") as file:
            csv_writer = writer(file)
            row=[df1.loc[i, "long1"], df1.loc[i,"lat1"], df2.loc[j, "long2"], df2.loc[j,"lat2"], distance]
            csv_writer.writerow(row)

This approach is very time-consuming. Is there a better approach?

James Z
  • 12,209
  • 10
  • 24
  • 44
amor
  • 1
  • 3
    you can start by opening the file only once, no need to opening, then closing it just to open it again in the next loop. Open the file first, get your csv_writer, then do the rest of the loop – Copperfield Apr 02 '21 at 02:48
  • 1
    Thanks that helped to reduce the runtime to some extent. However, it is still taking a long time. Is there any other approach? Also, how long should it take to complete the execution? – amor Apr 02 '21 at 07:13
  • Do you really need the full distance? or just within radius or closest K – Willem Hendriks Apr 02 '21 at 07:28
  • I need coordinates within a radius of 5 km – amor Apr 02 '21 at 07:30
  • 1
    In that case you want to use the sklearn Balltree. See https://stackoverflow.com/questions/63121268/how-can-i-introduce-the-radio-in-query-radius-balltree-sklearn-radians-km/63132760#63132760 for a example. Let me know if stuck/questions – Willem Hendriks Apr 02 '21 at 09:23

1 Answers1

-1

as already told, the output can be opened only one time. This algorithm is O(NxM), and you can't do too much, but with some tricks, it can run faster. The simple solution can be to create all needed combinations using itertools.product and split in chunks, and solve with a similar code in multiprocess solution. This is an example. The hard solution is to find another approach to the problem, if some approximations are available you can try to reduce the dataset creating clusters of similar coordinated.

Glauco
  • 1,385
  • 2
  • 10
  • 20
  • OP has specific the problem more in comments; and is interested in locations within radius. For this one there are algorithms preventing to do loops. – Willem Hendriks Apr 04 '21 at 19:00