1

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?

I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)

bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz

The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised

I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Sean
  • 145
  • 2
  • 12
  • perhaps it's just a typo `.vcf.bgz` -> `.vcf.bz`? – chickity china chinese chicken May 08 '19 at 18:40
  • Unfortunately not, I have tried a bunch of different output file types. I wish it was as simple as a typo... – Sean May 08 '19 at 18:42
  • 1
    I mean, are you sure `"722g.990.SNP.INDEL.chrAll.vcf.bgz"` is in the output error message? because that term is not in the command you provided. – chickity china chinese chicken May 08 '19 at 18:43
  • @davedwards you’re right, the error message is mismatched with the command but both don’t work. I’ll fix that typo – Sean May 08 '19 at 18:45
  • What is the exact (verbatim) error message from `bcftools`? If you run the `file` command on the input file, what does it print? – Jukka Matilainen May 08 '19 at 19:59
  • @JukkaMatilainen Here is the error message verbatim: The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised and the result of the file command is: 722g.990.SNP.INDEL.chrAll.vcf.gz: gzip compressed data, extra field – Sean May 08 '19 at 20:03
  • Did you know you can use `bcftools view` with either the `-s` (list samples on command line) or `-S` (list samples in a file) flag to select just a few samples from your file? – jena Aug 03 '21 at 10:43

2 Answers2

4

Double check your command line for bcftools view.

  1. The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.

  2. Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.

Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.

Jukka Matilainen
  • 9,608
  • 1
  • 25
  • 19
  • Okay awesome, I didn’t realize that those were supposed to be the other way around. That error message seems to have disappeared. But now it is saying: “[W::hts_idx_load2] The index file is older than the data file: Yadayada.vcf.gz.tbi, do you know what’s going on here? – Sean May 09 '19 at 16:58
  • I think it's possible it was a false positive error because I got an output file with data. Not sure how to view that data though... – Sean May 09 '19 at 17:18
  • 3
    That message just says that the index you have was generated before the file it is supposed to be indexed. Probably, the index won't match the vcf file and will yield wrong results if used. Use `tabix` to regenerate the index and retry the `vcftools` command afterwards. – Poshi May 10 '19 at 07:58
3

I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.

gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz

Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.

As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.

less -S largefile.vcf.gz 

quit the view with q and g takes you to the top of the file.

Richard J. Acton
  • 885
  • 4
  • 17