Say I have a database table representing users with potentially millions of records (Wishful thinking). This table contains a whole bunch of information about each user including information about their location:
- City
- County/State etc
- Country
- Latitude
- Longitude
- Geohash based on the latitude/longitude values.
I would like to implement a feature where by a logged in user can search for other users that are nearby.
Ideally, I would like to grab say the 20 users that are geographically closest to the user, followed by the next 20, and the next 20 etc. So essentially I want to be able to order my users table by the distance from a certain point.
Approach 1
I have some previous experience with the haversine formula which I used to calculate the distance between one point and a few hundred others. This approach would be ideal on a relatively small record set but I fear it would become incredibly slow with such a large record set.
Approach 2
I've additionally done some research into geohashing and I understand how the hash is calculated and I get the theory behind how it represents a location and how precision is lost with shorter resolutions. I could of course grab the users that are located near the user's geographical area by grabbing users that have a similar beginning to their geohash (Based on a precision I specify - and potentially looking in the neighbouring regions) but that doesn't solve the problem of needing to sort by location. This approach is also not great for edge cases where 2 users may be very close to one another but lie close to the edges of 2 regions represented by the geohash.
Any ideas/suggestion towards the approach would be greatly appreciated. I'm not looking for code in particular but links to good examples and resources would be helpful.
Thanks, Jonathon
Edit
Approach 3
After some thought I've come up with another potential solution to consider. Upon receiving each user's location information, I would store information about the location (town/city, area, country, latitude, longitude, geohash maybe) in a separate table (say locations
). I would then connect the user to the location by a foreign key. This would give me a much smaller dataset to work with. To find nearby users I could then simply find other locations that are close to the user's location and then use their IDs to find other users. Perhaps some sort of caching could be then implemented by storing a list of the nearby location IDs for each location.