0

I am Moritz from the University Heidelberg in Germany.

For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

Steps included are:

  • Alignment (BWA (Transform sai and sam)) against hg19.fasta
  • Transform (samtols)
  • Dedupe

The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

How can I shorten one file?

Where can I find a short sequence that I can use to test my pipeline?

moritz
  • 1
  • 1

1 Answers1

0

You can find many cancer sequence datasets at the NCBI SRA (sequence Read Archive Database)

http://www.ncbi.nlm.nih.gov/sra?term=cancer

The SRA formatted sequence files can be converted to FASTQ using "fastq-dump" to align with BWA

http://azaleasays.com/2011/09/09/convert-sra-format-to-fastq/

gani
  • 21
  • 3