2

Using R, what is the best way to read a symmetric matrix from a file that omits the upper triangular part. For example,

1.000
.505  1.000
.569  .422  1.000
.602  .467  .926  1.000
.621  .482  .877  .874  1.000
.603  .450  .878  .894  .937  1.000

I have tried read.table, but haven't been successful.

IRTFM
  • 258,963
  • 21
  • 364
  • 487
Student
  • 325
  • 2
  • 9

5 Answers5

15

Here's a read.table and loopless and *apply-less solution:

txt <- "1.000
.505  1.000
.569  .422  1.000
.602  .467  .926  1.000
.621  .482  .877  .874  1.000
.603  .450  .878  .894  .937  1.000"
 # Could use clipboard or read this from a file as well.
mat <- data.matrix( read.table(text=txt, fill=TRUE, col.names=paste("V", 1:6))  )
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
> mat
        V1    V2    V3    V4    V5    V6 
[1,] 1.000 0.505 0.569 0.602 0.621 0.603
[2,] 0.505 1.000 0.422 0.467 0.482 0.450
[3,] 0.569 0.422 1.000 0.926 0.877 0.878
[4,] 0.602 0.467 0.926 1.000 0.874 0.894
[5,] 0.621 0.482 0.877 0.874 1.000 0.937
[6,] 0.603 0.450 0.878 0.894 0.937 1.000
IRTFM
  • 258,963
  • 21
  • 364
  • 487
3

I copied your text, and then used tt <- file('clipboard','rt') to import it. For a standard file:

tt <- file("yourfile.txt",'rt')
a <- readLines(tt)
b <- strsplit(a,"  ") #insert delimiter here; can use regex
b <- lapply(b,function(x) {
  x <- as.numeric(x)
  length(x) <- max(unlist(lapply(b,length))); 
  return(x)
})
b <- do.call(rbind,b)
b[is.na(b)] <- 0
#kinda kludgy way to get the symmetric matrix
b <- b + t(b) - diag(b[1,1],nrow=dim(b)[1],ncol=dim(b)[2]
Blue Magister
  • 13,044
  • 5
  • 38
  • 56
1

I'm posting but I like Blue Magister's approach wat better. But maybe there's something in this that's of use.

mat <- readLines(n=6)
1.000
.505  1.000
.569  .422  1.000
.602  .467  .926  1.000
.621  .482  .877  .874  1.000
.603  .450  .878  .894  .937  1.000

nmat <- lapply(mat, function(x) unlist(strsplit(x, "\\s+")))
lens <- sapply(nmat, length)
dlen <- max(lens) -lens
bmat <- lapply(seq_along(nmat), function(i) {
    as.numeric(c(nmat[[i]], rep(NA, dlen[i])))
})
mat <- do.call(rbind, bmat)
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
mat
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
1

Here is an approach which also works if the dimensions of the matrix are unknown.

# read file as a vector
mat <- scan("file.txt", what = numeric())

# calculate the number of columns (and rows)
ncol <- (sqrt(8 * length(mat) + 1) - 1) / 2

# index of the diagonal values
diag_idx <- cumsum(seq.int(ncol))

# generate split index
split_idx <- cummax(sequence(seq.int(ncol)))
split_idx[diag_idx] <- split_idx[diag_idx] - 1

# split vector into list of rows
splitted_rows <- split(mat, f = split_idx)

# generate matrix
mat_full <- suppressWarnings(do.call(rbind, splitted_rows))
mat_full[upper.tri(mat_full)] <- t(mat_full)[upper.tri(mat_full)]


   [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
0 1.000 0.505 0.569 0.602 0.621 0.603
1 0.505 1.000 0.422 0.467 0.482 0.450
2 0.569 0.422 1.000 0.926 0.877 0.878
3 0.602 0.467 0.926 1.000 0.874 0.894
4 0.621 0.482 0.877 0.874 1.000 0.937
5 0.603 0.450 0.878 0.894 0.937 1.000
Sven Hohenstein
  • 80,497
  • 17
  • 145
  • 168
  • Sven, thanks. One thing though - since its a symmetric matrix, nrow = ncol. One could find the dimensions by using something like R.utils::countLines(). – Student Dec 13 '12 at 19:53
  • @Student I know it's a symmetric matrix. That's why only one value is calculated. Furthermore, `countLines` requires to open the file again. – Sven Hohenstein Dec 14 '12 at 06:42
0

This won't work in the OP's case because the diagonal was 1, but if the diagonal is zero or missing, then you can use as.dist%>%as.matrix to copy the lower diagonal to the upper diagonal and set the diagonal to zero:

input=" Pop0    Pop1    Pop2
Pop0
Pop1    0.015
Pop2    0.079   0.083
Pop3    0.014   0.016   0.073"

as.matrix(as.dist(cbind(read.table(text=input,fill=T),NA)))

Result:

      Pop0  Pop1  Pop2  Pop3
Pop0 0.000 0.015 0.079 0.014
Pop1 0.015 0.000 0.083 0.016
Pop2 0.079 0.083 0.000 0.073
Pop3 0.014 0.016 0.073 0.000

In my case the input had column names, so read.table(fill=T) was automatically able to determine the number of columns and IRTFM's trick of specifying col.names=1:4 was not neeeded.

nisetama
  • 7,764
  • 1
  • 34
  • 21