0

I am fairly new to R, and have been using data.table a lot recently for a project involving manipulation of large data sets, specifically genome data. One of the columns is the chromosome number/name, which is formatted as "chr_", where the _ is 1-22, X, or Y. As the data is sorted by chromosomal position, this is a natural primary key for my data. However, setting this as the key produces unwanted results, namely sorting by lexicographic order rather than general numeric order (i.e. the order is 1,10,11,...,19,2,20,...,X,Y rather than 1,2,...,9,10,11,...,19,20,...,X,Y). I looked at the documentation for the factor() function, which includes an option ordered, which implicitly reads the factor levels as ordered. However, I do not know of a way of specifying that the chromosome column should be an ordered factor, as the only related options are stringsAsFactors (this would convert all strings to factors, which would be highly inefficient considering the number of non-unique strings in other columns) and colClasses, where I don't know of any method of casting columns to implicitly ordered factors.

Does anyone know of an implementation of implicitly ordered factors for fread(), or any efficient method for data.table to convert a character column to an ordered factor?

NOTE:

I am mainly looking for the most efficient implementations, preferably ones that directly cast the column to an ordered factor during the read itself.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
archaephyrryx
  • 415
  • 2
  • 10

2 Answers2

0

From the description, it seems like this might help

 set.seed(42)
 dat <- data.frame(chrN= sample(c(paste0("chr", c(1:22, "X", "Y"))), 24, replace=FALSE),    value=rnorm(24), stringsAsFactors=FALSE)
 library(gtools)
 dat[mixedorder(dat[,1]),]

 ordered(dat[,1], levels=mixedsort(unique(dat[,1])))
 #[1] chr22 chrY  chr7  chr18 chr13 chr10 chr14 chr3  chr11 chr16 chrX  chr19
#[13] chr12 chr17 chr5  chr9  chr8  chr1  chr15 chr6  chr4  chr21 chr2  chr20
#24 Levels: chr1 < chr2 < chr3 < chr4 < chr5 < chr6 < chr7 < chr8 < ... < chrY
akrun
  • 874,273
  • 37
  • 540
  • 662
0

Just specify the levels for the factor directly.

d <- data.frame(chr=sample(c(1:22, "X", "Y"), 100, replace=T))
d$chr <- factor(d$chr, levels=c(1:22, "X", "Y"))
ordered(d$chr)

The output is

[1] 8  8  4  18 6  4  8  17 14 17 8  Y  16 3  15 22 9  16 11 17 12 17 12 11 18
[26] 16 X  10 15 7  18 6  Y  Y  21 13 21 2  2  Y  21 8  4  21 X  6  12 19 14 10
[51] 7  15 10 19 4  21 20 14 18 4  4  11 7  14 17 17 2  9  1  11 16 17 19 14 1 
[76] 19 12 18 18 13 10 17 21 18 17 Y  Y  4  21 19 17 5  Y  X  7  8  18 22 13 5 
24 Levels: 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 < 10 < 11 < 12 < 13 < ... < Y
rmccloskey
  • 482
  • 5
  • 14