2

I have a fairly large dataset of georeferenced data. It's about 49 million records. Using data.table package I've been able to subset it (originally was 100 millions records) and make some simple calculations, like the mean center for the geographic coordinates in degrees for each user.

The unique users are 214,600 and I need to calculate the distance (using great circle formula, SEE my sample code) for each unique user from every geographic coordinate for every record to the mean center of those geographic coordinates. That means I need to use columns V6 and V7 (longitude and latitude respectively) to do the great circle distance calculation. V4 is the userID and V3 is the userImageID, V8 (the column that has 16) is the accuracy for the coordinates. V5 is the time field, which I already sorted (order) in ascending order.

My problem is I cannot make the code iterate only for the coordinates that exclusively correspond to every user and I end up running out of memory because the first record run through the other 48 million records and so on.

I have tried to use my own formula for the great circle distance calculation and also formlas from the packages fossil and geosphere, with no results.

This is more or less how my table looks like (and I'm calling only the columns I'm using). We see here the data for the first user, who has 4 geotagged images.

> subtest
V1      V2         V3            V4                    V5        V6       V7    V8
1:  1  155229 9468411072 100004812@N06 2006-03-19 13:11:37.0 -2.224868 52.20397 16
2:  2  862398 9468409452 100004812@N06 2006-03-19 13:11:49.0 -2.224825 52.20399 16
3:  3 7931625 9465604241 100004812@N06 2006-03-19 15:12:23.0 -2.224890 52.20391 16
4:  4 7924096 9465627119 100004812@N06 2006-03-19 15:12:49.0 -2.224868 52.20397 16

And my code:

library(data.table)
library(fossil)
library(geosphere)
setwd("E:/MassiveDatasets/LargeDataset")
yahoo2 <- fread("LD.csv", sep = ",", header = FALSE, colClasses="numeric")

a <-yahoo2
mlong <- a[, lapply(.SD, mean), by=V4, .SDcols = 6]
mlat <- a[, lapply(.SD, mean), by=V4, .SDcols = 7]
rad <- pi/180
b1 <- (mlat[,V7] * rad)
b2 <- (mlong[,V6] * rad)
Dist <- function(v) { 
  for (i in unique(a[, V3])) { 
  a1 <- a[, V7] * rad
  a2 <- a[, V6] * rad
  dlon <- b2 - a2
  dlat <- b1 - a1
  GC <- (sin(dlat/2))^2 + cos(a1) * cos(b1) * (sin(dlon/2))^2
  c <- 2 * atan2(sqrt(GC), sqrt(1 - GC))
  R <- 6371.0087714  # WGS84 mean radius   
  d <- R * c
  return(d)
  }
}

rgyr <- a[, lapply(.SD, Dist), by=V4]

Thank you very much in advance for your answers!

Community
  • 1
  • 1
  • The distance from each user to what? – Frank Nov 21 '14 at 23:14
  • does `Dist` work on its own? the parameter v is not used inside the function. also, does it return the expected number of outputs? – rawr Nov 21 '14 at 23:14
  • Also, you know that you can name your columns, right? See the `setnames` function. – Frank Nov 21 '14 at 23:14
  • 1
    You have a `return` statement inside your `for` loop. That means it will only do the first iteration of the loop, and then return and exit. – Frank Nov 21 '14 at 23:15
  • Hey Frank and rawr, thanks for the comments. I haven't named the columns because I understand data.table doesn't like that. The distance is from every geographic coordinate for each user to each user mean center for all the geographic coordinates they have recorded. Thank you for the suggestions! – user4280345 Nov 23 '14 at 16:41
  • 1
    FYI data.table actually encourages naming columns, and referring to them by names not by their order. – Merik Sep 19 '15 at 03:10

0 Answers0