1

I have two data sets, the fire data set is huge and the global temp data set is quite a bit smaller than it.

So I would like to match the two data sets by DISCOVERY_DATE = date, Latitude = latitude and longitude = longitude. Now i know most of them will not be a match but i am looking just for as close as match as possible. I think fuzzyjoin would be a good way to go about this but how would one match all three with this.

Im thinking the issue may be that I cant seem to find a good function for this.

 tempFire <- fuzzy_join(fires, Temps, multi_by = c("DISCOVERY_DATE" = "date", "LONGITUDE" = "Longitude", "LATITUDE" = "Latitude"), multi_match_fun = D, mode = "full")

Data

> head(z, n =10)
   fires.LATITUDE fires.LONGITUDE fires.DISCOVERY_DATE
1        40.03694       -121.0058           1970-01-29
2        38.93306       -120.4044           1970-01-29
3        38.98417       -120.7356           1970-01-29
4        38.55917       -119.9133           1970-01-29
5        38.55917       -119.9331           1970-01-29
6        38.63528       -120.1036           1970-01-29
7        38.68833       -120.1533           1970-01-29
8        40.96806       -122.4339           1970-01-29
9        41.23361       -122.2833           1970-01-29
10       38.54833       -120.1492           1970-01-29
    > head(b, n = 10)
   Temps.Latitude Temps.Longitude Temps.date
1           32.95         -100.53 1992-01-01
2           32.95         -100.53 1992-02-01
3           32.95         -100.53 1992-03-01
4           32.95         -100.53 1992-04-01
5           32.95         -100.53 1992-05-01
6           32.95         -100.53 1992-06-01
7           32.95         -100.53 1992-07-01
8           32.95         -100.53 1992-08-01
9           32.95         -100.53 1992-09-01
10          32.95         -100.53 1992-10-01
Clinton Woods
  • 249
  • 1
  • 2
  • 11
  • Have you looked at [fuzzyjoin](https://github.com/dgrtwo/fuzzyjoin) ? I don't think it does "multi-match" of different types in one join, but maybe you could do _e.g._ an interval join first for dates, then a geo join for coordinates. – neilfws Oct 23 '17 at 05:22
  • ya im just not sure what function could do all this work. – Clinton Woods Oct 23 '17 at 05:24
  • @ClintonWoods I have no experience with geo data, can you please explain how to transform `32.95N` (from `b`) to numeric format like `40.03694` in `z`? – pogibas Oct 23 '17 at 08:01
  • Sorry, but I can't tell from your post: are you working with data frames here? If so, maybe `data.table`'s support for non-equi joins is what you're looking for. If your data are already in `sp` objects, you might consider working with the `data` slot alone, then rebuilding the object after merging. – coletl Oct 23 '17 at 15:06
  • @PoGibas i changed the lat and lon to the same format. – Clinton Woods Oct 23 '17 at 17:19

1 Answers1

2

I would recommend that you come up with an appropriate distance metric based on a weighted combination of temporal distance (i.e. subtracting the dates) and spatial distance (based on lat & long). Determine the weights based on the relative importance of spatial and temporal proximity for your application. Then compute a matrix containing the distance from every point in the first data set to every point in the second data set using this distance metric. Finally, find the minimum distance in each row and/or column to select data points in one dataset that are closest to the points in the other data set. You will probably want to discard any pairs with a distance greater than some threshold.

Ryan C. Thompson
  • 40,856
  • 28
  • 97
  • 159
  • https://stackoverflow.com/questions/20590119/fuzzy-matching-of-coordinates . This is something similar to what you are saying but when i tried it just for lat and lon it told me it could not calculate due to a 500 GB vector. Maybe a more compact way? – Clinton Woods Oct 23 '17 at 21:11
  • It sounds like these two data sets are quite large. Depending on what you're trying to do, you may only need one row or column of the distance matrix at a time. You could write a loop that goes through each of the rows in one data set one by one and compares it with the entire other data set. That will get around the memory problem, but it will obviously take longer to run. – Ryan C. Thompson Oct 24 '17 at 03:52