How do I shorten a genome sequence to secure my workflow is properly functioning?

Question

I am Moritz from the University Heidelberg in Germany.

For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

Steps included are:

Alignment (BWA (Transform sai and sam)) against hg19.fasta
Transform (samtols)
Dedupe

The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

How can I shorten one file?

Where can I find a short sequence that I can use to test my pipeline?

1

try asking www.biostars.org – Stylize Jul 14 '13 at 15:36

score 0 · Answer 1 · answered Jul 24 '13 at 20:09

You can find many cancer sequence datasets at the NCBI SRA (sequence Read Archive Database)

http://www.ncbi.nlm.nih.gov/sra?term=cancer

The SRA formatted sequence files can be converted to FASTQ using "fastq-dump" to align with BWA

http://azaleasays.com/2011/09/09/convert-sra-format-to-fastq/

How do I shorten a genome sequence to secure my workflow is properly functioning?

1 Answers1