1

I am trying to do the steps mentioned in http://flowingdata.com/2011/05/11/how-to-map-connections-with-great-circles/ but using data.table. Especially step 8 listed there. Attached are my steps and the problem I'm running into:

library(data.table)
library(maps)
library(geosphere)
airports <- as.data.table(read.csv("http://datasets.flowingdata.com/tuts/maparcs/airports.csv", header=TRUE))
flights <- as.data.table(read.csv("http://datasets.flowingdata.com/tuts/maparcs/flights.csv", header=TRUE, as.is=TRUE))

setnames(airports,c("airport1",names(airports)[2:7]))
setkey(flights,airport1)
setkey(airports,airport1)
ap <- merge(flights,airports)
setkey(ap,airport2)
setnames(airports,c("airport2",names(airports)[2:7]))
setkey(airports,airport2)
setkey(ap,airport2)
ap2 <- merge(ap,airports)
ap3 <- ap2[,.(airport1,airport2,airline,cnt,lat.x,long.x,lat.y,long.y)]
## ap3[,inter:=gcIntermediate(c(long.x,lat.x),c(long.y,lat.y),n=100,addStartEnd=TRUE),]  ## Error in .pointsToMatrix(p1) : Wrong length for a vector, should be 2
## ap3[,inter:=gcIntermediate(c(long.x,lat.x),c(long.y,lat.y),n=100,addStartEnd=TRUE),]  ## Error in .pointsToMatrix(p1) : Wrong length for a vector, should be 2
## 
## Tried some more stuff but no luck!
## fn <- function(lonx,latx,lony,laty) gcIntermediate(c(lonx,latx),c(lony,laty),n=100,addStartEnd=TRUE)
## ap3[,do.call(fn,.SD),.SDcols=5:8] ## Error in (function (lonx, latx, lony, laty)  : unused arguments (lat.x = c(35.21401111, 35.2140 ... snip ...

So I searched stackoverflow and tried steps listed in [1] and [2] but couldn't get it to work. I remember reading somewhere (cannot find it now though) that data.table can store lists but I cannot figure out how. Also, is there some way to debug functions in the j apart from what's listed in the Section 2.9 of the FAQ?

[1] efficient row-wise operations on a data.table

[2] Applying a function to each row of a data.table

leppie
  • 115,091
  • 17
  • 196
  • 297
Vijay
  • 151
  • 1
  • 11
  • 1
    It's nice that this is reproducible, but do you really need us to install those packages? Seems like a lot of complexity for a fairly simple question (how to use list columns). – Frank Jun 26 '15 at 19:18
  • No. But then I don't know how to express the issue I'm running into, sorry. If I could just find out how to capture a list/matrix of differing lengths/rows (returned from a function...not creating one manually) into a data.table column that will work. – Vijay Jun 26 '15 at 20:08

2 Answers2

3

Suppose you have a function that returns a matrix of unknown size. You can assign the result in a data.table with a list column:

# example data
set.seed(42)
DT <- data.table(id=1:3)[,.(v=sample(letters,sample(5,1))),by=id]

# example function
myfun = function(x) matrix(x, ncol= if(length(x)%%2) 1 else 2 )

# usage 
res <- DT[,.(vlist = list(myfun(v))),by=id]
#    id     vlist
# 1:  1 y,h,t,o,l
# 2:  2   d,q,y,k
# 3:  3   y,g,l,v

This may not look like a column of matrices, but you can see that it is:

str(res$vlist)
# List of 3
#  $ : chr [1:5, 1] "y" "h" "t" "o" ...
#  $ : chr [1:2, 1:2] "d" "q" "y" "k"
#  $ : chr [1:2, 1:2] "y" "g" "l" "v"

res$vlist[[2]]
#      [,1] [,2]
# [1,] "d"  "y" 
# [2,] "q"  "k" 

(I'm not sure if this is what you're after, as I didn't go through the linked blog post.)

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 2
    Thanks, Frank. This is exactly what I was looking for. I wasn't aware that `.()` is needed to make this work but apparently it is. I was trying this without the `.()` and that's where I kept getting strange errors. – Vijay Jun 27 '15 at 00:20
3

This should be really a comment, but it doesn't fit there: For each p1 and p2 as defined by c(long.x,lat.x) and c(long.y,lat.y), respectively, you have a matrix (or a list) (hereafter, I focus on the matrix only) and dimension of that matrix depends on values of n and addStartEnd. For example, if you set n=1 and addStartEnd=FALSE, it will return a matrix of dimension of 1 by 2, and if you set n=1 and addStartEnd=TRUE, it will return a matrix of dimension of 3 by 2. Now, with data.table operation like yours, you can't simply append the values. I am not a data.table expert, but what I think a right way, is that you have to do rowwise operation and then use rbindlist.,e.g.,

apt<-setDT(ap3)

tt<-rbindlist(lapply(1:nrow(apt),function(i)cbind(apt[i,],gcIntermediate(apt[i,c("long.x","lat.x")],apt[i,c("long.y","lat.y")],n=100,addStartEnd=TRUE))))

> tt
        airport1 airport2 airline cnt    lat.x     long.x    lat.y    long.y        lon      lat
     1:      CLT      ABE     all  56 35.21401  -80.94313 40.65236  -75.4404  -80.94313 35.21401
     2:      CLT      ABE     all  56 35.21401  -80.94313 40.65236  -75.4404  -80.89245 35.26904
     3:      CLT      ABE     all  56 35.21401  -80.94313 40.65236  -75.4404  -80.84171 35.32405
     4:      CLT      ABE     all  56 35.21401  -80.94313 40.65236  -75.4404  -80.79090 35.37904
     5:      CLT      ABE     all  56 35.21401  -80.94313 40.65236  -75.4404  -80.74002 35.43401
    ---                                                                                         
510710:      PHX      YUM      YV 328 33.43417 -112.00806 32.65658 -114.6060 -114.50396 32.68840
510711:      PHX      YUM      YV 328 33.43417 -112.00806 32.65658 -114.6060 -114.52947 32.68045
510712:      PHX      YUM      YV 328 33.43417 -112.00806 32.65658 -114.6060 -114.55498 32.67250
510713:      PHX      YUM      YV 328 33.43417 -112.00806 32.65658 -114.6060 -114.58048 32.66454
510714:      PHX      YUM      YV 328 33.43417 -112.00806 32.65658 -114.6060 -114.60597 32.65658

As per the suggestion of @Frank: you can proceed as follows using only data.table operation (where 102 =100 (n)+ 2 (addStartEnd=TRUE))

ap3[,gcIntermediate(c(long.x,lat.x),c(long.y,lat.y),n=100,addStartEnd=TRUE),by=1:nrow(ap3)][,list(lon=head(V1,102),lat=tail(V1,102)),by=nrow]
        nrow        lon      lat
     1:    1  -80.94313 35.21401
     2:    1  -80.89245 35.26904
     3:    1  -80.84171 35.32405
     4:    1  -80.79090 35.37904
     5:    1  -80.74002 35.43401
    ---                         
510710: 5007 -114.50396 32.68840
510711: 5007 -114.52947 32.68045
510712: 5007 -114.55498 32.67250
510713: 5007 -114.58048 32.66454
510714: 5007 -114.60597 32.65658
user227710
  • 3,164
  • 18
  • 35
  • Thanks @Frank. Corrected now. – user227710 Jun 26 '15 at 20:30
  • 1
    The code looks correct, but you might want to note that `setDT` works by reference, so you've actually modified `ap3` (and so don't need to assign the result to a new object). Also, it looks like your code could take the form `apt[,some_thing,by=1:nrow(apt)]`, which would be more idiomatic for `data.table` than the `rbindlist`+`lapply` approach. – Frank Jun 26 '15 at 20:42
  • @Frank: It creates a column with 1021428 rows. Do you have have any idea how to split into two columns? – user227710 Jun 26 '15 at 20:59
  • 1
    Maybe `as.list(myfun(...))`? I'm not sure. – Frank Jun 26 '15 at 21:05
  • @Frank: I have incorporated your comment in the dat.table solution. Hope that is what you are referring to. – user227710 Jun 26 '15 at 21:37
  • Ok cool. Not sure where `by=nrow` comes from. Isn't `nrow` a function? – Frank Jun 26 '15 at 21:45
  • 1
    `nrow` is a new column generated as a result of the first operation. Then in the second operation we select first 102 rows of col V1 by each nrow(1,2,...) and then assign this to lon column and the last 102 rows of col V1 by nrow to lat column. The second operation will be clear, once you omit the second operation and see the output of only first operation. – user227710 Jun 26 '15 at 22:12