Trying to create a function to join two datasets by closest gps coordinate

Question

I am trying to merge two datasets which contain GPS coordinates such that I am left with one dataset with variables from both datasets. I am trying to use a function to achieve this. The problem is that the GPS coordinates from both datasets do not exactly match. So the task is to match the variables of one dataset with the variables of the other dataset by finding the closest pairing of gps coordinates.

I have had success with the fuzzy join package, but was only able to get partial matching (~75%). With the function below, I was hoping to get a higher degree of matching. One dataset is shorter than the other, so the idea here was to use two for loops, with each for-loop going through each dataset.

An "anchor" (distance between the first observations of both datasets) is established, such that if the distance between the two points is less than the anchor, the new (shorter) distance becomes the new anchor. The for-loop continues until the shortest distance is found, and the variables from both datasets are appended to the end of a new dataset, called pairedData here. I should be left with a dataset as long as the shortest dataset used (6314 rows) with data taken from both datasets.

I think the function should work, but rbind() is super slow, and I have been having trouble implementing rbindlist(). Any ideas on how I might achieve this?

combineGPS <- function(harvest,planting) {
require(sp)
require(data.table)
longH <- harvest$long
latH <- harvest$lat
longP <- planting$long
latP <- planting$lat
rowsH <- nrow(harvest)
rowsP <- nrow(planting)
harvestCoords <- cbind(longH,latH)
harvestPoints <- SpatialPoints(harvestCoords)
plantingCoords <- cbind(longP,latP)
plantingPoints <- SpatialPoints(plantingCoords)

#planting data is shorter than harvest data

#need to take each row of planting data (6314) and find closest harvest data point (16626), then attach

anchor <- spDistsN1(plantingPoints[1,],harvestPoints[1,],longlat=FALSE)
pairedData <- data.frame(long=numeric(),
               lat=numeric(), 
               variety=factor(), 
               seedling_rate=numeric(),
               seed_spacing=numeric(),
               speed=numeric(),
               yield=numeric(),
               stringsAsFactors=FALSE) 

for (p in 1:rowsP){
     for (h in 1:rowsH){

   if(spDistsN1(plantingPoints[p,],harvestPoints[h,],longlat=FALSE) <= anchor){
    anchor <- spDistsN1(plantingPoints[p,],harvestPoints[h,],longlat=FALSE)
    pairedData[p,]<-c(planting[p,]$long, planting[p,]$lat, planting[p,]$variety, planting[p,]$seedling_rate, planting[p,]$seed_spacing, planting[p,]$speed, harvest[h,]$yield)
   }    

       }
   }
  return(pairedData)
}
doesItWork=combineGPS(harvest,planting)
doesItWork

score 0 · Answer 1 · answered May 17 '16 at 00:58

If I understand your question correctly, I'm not sure why you need the for loop over the harvest data. The function spDistsN1 will return a matrix of the distances to the point specified. I think you should use your harvest data as the pts, and the planting data as the pt input to this function, and then find the index with the shortest distance to each pt. Loop over the planting data only. Will save a lot of time. Also, do not specify longlat in spDistsN1 because your data are SpatialPoints and the function says not to specify for these objects.

Example Loop:

for (p in 1:rowsP){
     #Get the distance from the pth planting point to all of the havest points
     Dists <- spDistsN1(pts = harvestPoints, pt = plantingPoints[p,])

     #Find the index of the nearest harvest point to p. This is the minimum of Dists. (Note that there may be more than one minimum)
     NearestHarvest <- which(Dists == min(Dists))

     #Add information to the paired data
     pairedData[p,]<-c(planting[p,]$long, planting[p,]$lat, planting[p,]$variety, planting[p,]$seedling_rate, planting[p,]$seed_spacing, planting[p,]$speed, harvest[NearestHarvest,]$yield) 
   }

Let me know if this is what you're looking for.

Also, you can initialize the pairedData data frame with the planting data, and in the for loop only add in the harvest yield data to the pairedData data frame. This will also save you some time in the loop.

score 0 · Answer 2 · answered Feb 15 '17 at 03:06

You need to map each row in the harvest file (16626) to a row in the planting (6314) file and not the other way around. The image below is the plot of harvest and plant gps coordinates on an xy plane. The red dots are the harvester points.

The precision farm machine is a multi-row planter & harvester. The gps device is housed inside the machine. i.e. each gps point refers to many rows of crop. In this case planter covers 2X rows compared to the harvester per trip. This explains why the harvest file has ~2X+ datapoints.

The basic approach would be brute force search, as the gps coordinates don't overlap between the files. I've solved this in R and Python by segmenting the entire field into smaller uniform grids and restricting the search to the nearest neighbouring grids. In terms of efficiency it takes ~3-4 min to solve and has an average of ~3 metres as the distance between planting and harvesting points, which is reasonable.

You can find the code on my Github

Trying to create a function to join two datasets by closest gps coordinate

2 Answers2