9

I'm trying to RMA normalize a particular gene expression dataset concerning diffuse large B-cell lymphoma using custom gene-level annotation CDF (chip definition file) files from Brainarray.

Unfortunately, the RMA normalized expression matrix is all NaNs, and I don't understand why.

The dataset (GSE31312) is freely available at the GEO website and uses the Affymetrix HG-U133 Plus 2.0 array platform. I'm using the affy package to perform the RMA normalization.

Since the problem is specific to the dataset, the following R code to reproduce the problem is unfortunately quite cumbersome (2 GB download, 8.8 GB unpacked).

Set the working directory.

setwd("~/Desktop/GEO")

Load the needed packages. Uncomment to install the packages.

#source("http://bioconductor.org/biocLite.R")
#biocLite(pkgs = c("GEOquery", "affy", "AnnotationDbi", "R.utils"))
library("GEOquery") # To automatically download the data
library("affy")
library("R.utils") # For file handling

Download the array data to the work dir.

files <- getGEOSuppFiles("GSE31312")

Untar the data in a dir called CEL

#Sys.setenv(TAR = '/usr/bin/tar') # For (some) OS X uncommment this line
untar(tarfile = "GSE31312/GSE31312_RAW.tar", exdir = "CEL")

Unzip the all .gz files

gz.files <- list.files("CEL", pattern = "\\.gz$", 
                       ignore.case = TRUE, full.names = TRUE)
for (file in gz.files)
  gunzip(file, skip = TRUE, remove = TRUE)


cel.files <- list.files("CEL", pattern = "\\.cel$", 
                       ignore.case = TRUE, full.names = TRUE)

Download, install, and load the custom Brainarray Ensembl ENSG gene annotation package

download.file(paste0("http://brainarray.mbni.med.umich.edu/Brainarray/",
                     "Database/CustomCDF/18.0.0/ensg.download/",
                     "hgu133plus2hsensgcdf_18.0.0.tar.gz"),
              destfile = "hgu133plus2hsensgcdf_18.0.0.tar.gz")
install.packages("hgu133plus2hsensgcdf_18.0.0.tar.gz",
                 repos = NULL, type = "source")
library(hgu133plus2hsensgcdf)

Perform the RMA normalization with and without the custom CDF.

affy.rma <- justRMA(filenames = cel.files, verbose = TRUE)
ensg.rma <- justRMA(filenames = cel.files, verbose = TRUE,
                    cdfname = "HGU133Plus2_Hs_ENSG")

As can be seen, the function returns without warning an expression matrix exprs(ensg.ram) where all entries in the expression matrix are NaN.

sum(is.nan(exprs(ensg.rma))) # == prod(dim(ensg.rma)) == 9964482

Interestingly, there are quite a few NaNs in exprs(affy.rma) when using the standard CDF.

# Show some NaNs
na.rows <- apply(is.na(exprs(affy.rma)), 1, any)
exprs(affy.rma)[which(na.rows)[1:10], 1:4]

# A particular bad probeset:
exprs(affy.rma)["1553575_at", ]   

# There are relatively few NaNs in total (but the really should be none)
sum(is.na(exprs(affy.rma)))  # == 12305

# Probesets of with all NaNs
sum(apply(is.na(exprs(affy.rma)), 1, all))

There really should be none NaNs at all. I've tried using the expresso function to perform background correction only (with no normalization and summarization) which also yield NaNs. So the problem appears to stem from the background correction. However, one might worry that it is one or more bad arrays that is the cause. Can anybody help me track down the origin of the NaNs and get some useful expression values?

Thanks, and best regards

Anders

Edit: A single file appears to be the issue (but not quite)

I decided to check what happens if each .CEL file is normalized independently. What actually happens underneath the RMA hood when justRMA is given a single array I'm not sure of. But I imagine that the quantile normalization step becomes the identity function and the summarization (median polish) simply stops after the first iteration.

Anyway, to perform the one-by-one RMA normalization we run:

ensg <- vector("list", length(cel.files))  # Initialize a list
names(ensg) <- basename(cel.files)
for (file in cel.files) {
  ensg[[basename(file)]] <- justRMA(filenames = file, verbose = TRUE,
                                    cdfname = "HGU133Plus2_Hs_ENSG")
  cat("File", which(file == cel.files), "done.\n\n")
}

# Extract the expression matrices for each file and combine them
X <- as.data.frame(do.call(cbind, lapply(ensg, exprs)))

By looking at head(X) it appears that GSM776462.CEL is all NaNs. Indeed, it almost is:

sum(!is.nan(X$GSM776462.CEL)) # == 18

Only 18 of 20009 is not NaN. Next, I count the number of NaN appearing other places

sum(is.na(X[, -grep("GSM776462.CEL", colnames(X))])) # == 0

which is 0. Hence GSM776462.CEL appears to be the culprit.

But the regular CDF annotation does not give any problems:

affy <- justRMA(filenames = "CEL/GSM776462.CEL")
any(is.nan(exprs(affy))) # == FALSE

It is also weird, that using regular CDF has seemingly randomly scattered NaNs in the expression matrix. I still don't quite know what to make of this.

Edit2: NaNs vanish when excluding sample but not with standard CDF

For what it is worth, when I exclude the GSM776462.CEL file and RMA normalize with and without the custom CDF files the NaNs only disappear in the custom CDF case.

# Trying with all other CEL than GSM776462.CEL
cel.files2 <- grep("GSM776462.CEL", cel.files, invert = TRUE, value = TRUE)
affy.rma.no.776462 <- justRMA(filenames = cel.files2)
ensg.rma.no.776462 <- justRMA(filenames = cel.files2, verbose = TRUE,
                              cdfname = "HGU133Plus2_Hs_ENSG")

sum(is.na(exprs(affy.rma.no.776462))) # == 12275
sum(is.na(exprs(ensg.rma.no.776462))) # == 0

Puzzling.

Edit3: No NAs or NaNs in "raw data"

For what it is worth, I've tried to read in the raw probe-level expression values to check if they contain NAs or NaNs. The following goes through the .CEL-files one-by-one and checks for any missing values.

for (file in cel.files) {
  affybtch <- suppressWarnings(read.affybatch(filenames = file))
  tmp <- exprs(affybtch)
  cat(file, "done.\n")
  if (any(is.na(tmp)))
    stop(paste("NAs or NaNs are present in", file))
}

No NAs or NaNs are found.

Anders Ellern Bilgrau
  • 9,928
  • 1
  • 30
  • 37

0 Answers0