2

I downloaded some Illumina 450k methylation datasets from Gene Expression Omnibus (GEO)

The R Bioconductor packages minfi and ChAMP seem to require something called a "sample sheet"

Most TAR files on GEO do not seem to contain such a sample sheet - they only contain the .idat files

Would any kind soul provide some advice? I would like to know how to run the ChAMP / Minfi pipeline without a sample sheet; otherwise, if there is any way to generate the sample sheets from the .idat files?

Thanks!

4 Answers4

2

I had a similar issue with a GEO project. What I did was I downloaded all of the .idat files and put them in their own folder. Then I used this code to parse the .idat filenames and create a sample sheet.

It will parse a filename like GSM1855609_9020331147_R02C02_Grn.idat and store everything in a .csv file. Then you can read the .csv file into R, add the standardized column names (c("Sample_Name", "Sentrix_ID", "Sentrix_Position")) that a function like logger wants to see, and you're on your way.

Hope this helps!

#!/usr/bin/env python
# Import the OS library
import os

# Get your Current Working Directory
cwd = os.getcwd()

# Get a list of all of the files (and directories, if there are any) in your directory.
# This will be a list of strings.
filenames = os.listdir(cwd)

# Split each one into the chunks that were separated by underscores ("_") and then keep the first three for each name.
# This will be a list of lists.
chunked_names = [filename.split("_")[0:3] for filename in filenames]

# For each name, rejoin the three chunks with commas
# We're back to having a list of strings.
csv_lines = [",".join(chunks) for chunks in chunked_names]
# Join all of those strings with the newline character to get just a long string.
contents = "\n".join(csv_lines)

# Print this string to standard output so that it can be redirected to a file.

print(contents)
dlapato
  • 21
  • 2
2

This is how I get the sample sheet and read idats into RGSet objects:

#using pacman to install and load packages
if (!require("pacman")) install.packages("pacman")
pacman::p_load("GEOquery","minfi")

#increase file download timeout
options(timeout = 600)

#download GEO object
gse <- getGEO("GSE12345", GSEMatrix = TRUE)
#get phenotype data - sample sheet
pd = pData(gse[[1]])

#get raw data - idats, processed beta matrix, etc.
getGEOSuppFiles("GSE12345")
#decompress idats
untar("GSE12345/GSE12345_RAW.tar", exdir = "GSE12345/idat")
#list files
head(list.files("GSE12345/idat", pattern = "idat"))
idatFiles <- list.files("GSE12345/idat", pattern = "idat.gz$", full = TRUE)
#decompress individual idat files
sapply(idatFiles, gunzip, overwrite = TRUE)
#read idats and create RGSet
RGSet <- read.metharray.exp("GSE12345/idat")

saveRDS(RGSet, "RGSet_GSE12345.RDS")
Dharman
  • 30,962
  • 25
  • 85
  • 135
0

If you want to read all the idat files in from a directory, you can just use:

my_450k <- read.450k.exp(base = "path/to/directory", recursive = TRUE)

At some stage you'll still need to match the phenotype data to the 450k data by the sample barcodes.

Nick Kennedy
  • 12,510
  • 2
  • 30
  • 52
0

The newer methylprep python package has a function to download GEO datasets. IT works for most series, despite many of them not having the same types of files in their archives.

methylprep also has a create sample_sheet command line option, if you need one to feed into minfi. Like so:

 python -m methylprep -v sample_sheet -d ~/GSE133062/GSE133062 --create

(where -d specifies the path to your unzipped .idat files)

More examples here: https://readthedocs.com/projects/life-epigenetics-methylprep/

Marc Maxmeister
  • 4,191
  • 4
  • 40
  • 54
  • Full disclosure: I'm the maintainer for methylprep. A big focus of this package is simplifying use of NIH's GEO data repository. – Marc Maxmeister May 20 '20 at 02:51