8

From a very simple dataframe like

    time1 <- as.Date("2010/10/10")
    time2 <- as.Date("2010/10/11")
    time3 <- as.Date("2010/10/12")
    test <- data.frame(Sample=c("A","B", "C"), Date=c(time1, time2, time3))

how can i obtain a matrix with pairwise temporal distances (elapsed time in days between samples) between the Samples A, B, C?

   A  B  C
A  0  1  2
B  1  0  1
C  2  1  0

/edit: changed the format of the dates. sorry for inconveniences

rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
nouse
  • 3,315
  • 2
  • 29
  • 56
  • @ZheyuanLi Write an answer, then you can also properly format your code. – Konrad Rudolph Jun 22 '16 at 12:37
  • 1
    In general the solution to this kind of problem in R is the `dist` function. In your case, `dist(test$Date)` “works” more or less; however, `dist` doesn’t know about time and so the result are just numbers, not `timediff` objects, which may be a problem. For that reason, the comment above by Zheyuan shows a better answer. – Konrad Rudolph Jun 22 '16 at 12:47

4 Answers4

8

To get actual days calculations, you can convert the days to a date since some pre-defined date and then use dist. Example below (converted your days, I doubt they were represented how you expected them to be):

time1 <- as.Date("02/10/10","%m/%d/%y")
time2 <- as.Date("02/10/11","%m/%d/%y")
time3 <- as.Date("02/10/12","%m/%d/%y")
test <- data.frame(Sample=c("A","B", "C"), Date=c(time1, time2, time3))
days_s2010 <- difftime(test$Date,as.Date("01/01/10","%m/%d/%y"))
dist_days <- as.matrix(dist(days_s2010,diag=TRUE,upper=TRUE))
rownames(dist_days) <- test$Sample; colnames(dist_days) <- test$Sample

dist_days then prints out:

> dist_days
    A   B   C
A   0 365 730
B 365   0 365
C 730 365   0

Actually dist doesn't need to convert the dates to days since some time, simply doing dist(test$Date) will work for days.

Andy W
  • 5,031
  • 3
  • 25
  • 51
  • The OP edited the dates when I was writing this answer. In the updated dates simply doing `dist(test$Date)` gives the answer. The way the dates were formatted previously I thought they should be different years though. – Andy W Jun 22 '16 at 12:58
  • I didnt know dist() was good for this formal as well! Thank you so far. – nouse Jun 22 '16 at 13:04
6

Using outer()

You don't need to work with a data frame. In your example, we can collect your dates in a single vector and use outer()

x <- c(time1, time2, time3)
abs(outer(x, x, "-"))

     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    1    0    1
[3,]    2    1    0

Note I have added an abs() outside, so that you will only get positive time difference, i.e, the time difference "today - yesterday" and "yesterday - today" are both 1.

If your data are pre-stored in a data frame, you can extract that column as a vector and then proceed.

Using dist()

As Konrad mentioned, dist() is often used for computation of distance matrix. The greatest advantage is that it will only compute lower/upper triangular matrix (diagonal are 0), while copying the rest. On the other hand, outer() forces computing all matrix elements, not knowing the symmetry.

However, dist() takes numerical vectors, and only computes some classes of distance. See ?dist

Arguments:

       x: a numeric matrix, data frame or ‘"dist"’ object.

  method: the distance measure to be used. This must be one of
          ‘"euclidean"’, ‘"maximum"’, ‘"manhattan"’, ‘"canberra"’,
          ‘"binary"’ or ‘"minkowski"’.  Any unambiguous substring can
          be given.

But we can actually work around, to use it.

Date object, can be coerced into integers, if you give it an origin. By

x <- as.numeric(x - min(x))

we get number of days since the first day in record. Now we can use dist() with the default Euclidean distance:

y <- as.matrix(dist(x, diag = TRUE, upper = TRUE))
rownames(y) <- colnames(y) <- c("A", "B", "C")

  A B C
A 0 1 2
B 1 0 1
C 2 1 0

Why putting outer() as my first example

In principle, time difference is not unsigned. In this case,

outer(x, x, "-")

is more appropriate. I added the abs() later, because it seems that you intentionally want positive result.

Also, outer() has far broader use than dist(). Have a look at my answer here. That OP asks for computing Hamming distance, which is really a kind of bitwise distance.

Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
5

A really fast solution using a data.table approach in two steps

# load library
 library(reshape)
 library(data.table)

# 1. Get all possible combinations of pairs of dates in long format
df <- expand.grid.df(test, test)
colnames(df) <- c("Sample", "Date", "Sample2", "Date2")

# 2. Calculate distances in days, weeks or hours, minutes etc
setDT(df)[, datedist := difftime(Date2, Date, units ="days")]

df
#>    Sample       Date Sample2      Date2 datedist
#> 1:      A 2010-10-10       A 2010-10-10   0 days
#> 2:      B 2010-10-11       A 2010-10-10  -1 days
#> 3:      C 2010-10-12       A 2010-10-10  -2 days
#> 4:      A 2010-10-10       B 2010-10-11   1 days
#> 5:      B 2010-10-11       B 2010-10-11   0 days
#> 6:      C 2010-10-12       B 2010-10-11  -1 days
#> 7:      A 2010-10-10       C 2010-10-12   2 days
#> 8:      B 2010-10-11       C 2010-10-12   1 days
#> 9:      C 2010-10-12       C 2010-10-12   0 days
rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
1

Here is a method that uses combn and matrix indexing.

# data
Sample=c("A","B", "C")
Date=as.Date(c("02/10/10", "02/10/11", "02/10/12"), format="%y/%m/%d")
# build a matrix to be filled
myMat <- matrix(0, length(Sample), length(Sample), dimnames=list(Sample, Sample))

# get all pairwise combinations (upper triangle)
samplePairs <- t(combn(Sample, 2))
# add the reverse combination (lower triangle)
samplePairs <- rbind(samplePairs, cbind(samplePairs[,2], samplePairs[,1]))
# calculate differences
diffs <- combn(Date, 2, FUN=diff)

# fill in differences using matrix indexing
myMat[samplePairs] <- diffs
lmo
  • 37,904
  • 9
  • 56
  • 69