2

I just sat down to write my first Nim script to parse a .vcf (Variant Call Format) file. This file format stores genetic mutations from sequencing data.

For scripting languages, I 'grew up' on Perl and later migrated to Python, but I would love to use a language with the speed that Nim offers. I realize Nim is still young, but I couldn't even find a clear example for how to open and read a .gz (gzip) file (preferably line by line).

Can anyone provide a simple example to open and read a gzip file using Nim, line by line?

In Python, I'm accustomed to the following (uber-simple) code:

import gzip

my_file = gzip.open('my_file.vcf.gz', 'w')
for line in my_file:
    # do something

my_file.close()

I have seen related questions, but they're not clear. The posts are also relatively old and I hope/suspect something better has come about. Here's what I've found:

  1. Read gzip-compressed file line by line
  2. File, FileStream, and GZFileStream
  3. Reading files from tar.gz archive in Nim

Really appreciate it.

P.S. I also think it would be useful if someone created a Nim tag in StackOverflow. I do not have the reputation to create tags.

Timur Shtatland
  • 12,024
  • 2
  • 30
  • 47
Mark Ebbert
  • 441
  • 3
  • 13
  • 1
    There is an official [zip wrapper](https://github.com/nim-lang/zip). You might want to have a look at the [tests](https://github.com/nim-lang/zip/blob/master/tests/gziptests.nim) – Maurice Meyer Aug 09 '19 at 12:59
  • 1
    The Nim tag was renamed Nim-lang some days ago. – xbello Aug 10 '19 at 13:25
  • @xbello, thank you. That must be why they don't let 'unreputable' characters like myself create new tags. – Mark Ebbert Aug 10 '19 at 16:02

3 Answers3

2

Just in case you need to handle VCF rather than .gz, there's a nice wrapper for htslib written by Brent Pedersen:

https://github.com/brentp/hts-nim

You need to install the htslib in your system, and then require the library in your .nimble file with requires "hts", or install the library with nimble install hts. If you are going to do NGS analysis in Nim you'll need it.

The code you need:

import hts

var v:VCF
doAssert open(v, "myfile.vcf.gz")
# Here you have the VCF file loaded in v, and can access the headers through
#  v.header property

for record in v:
    # Here you get a Record object per line, e.g. extract the Ref and Alts:
    echo v.REF, " ", v.ALT

v.close()

Be sure to follow the docs, because some things differ from python, specially when getting the INFO and FORMAT fields.

Checkout the whole Brent repo. It has plenty of wrappers, code samples and utilities to handle NGS problems (e.g. an ultrafast coverage tool utility called Mosdepth).

xbello
  • 7,223
  • 3
  • 28
  • 41
1

Per suggestion from Maurice Meyer, I looked at the tests for the Nim zip package. It turned out to be quite simple. This is my first Nim script, so my apologies if I didn't follow convention, etc.

import zip/gzipfiles  # Import zip package

block:
  let vcf = newGzFileStream("my_file.vcf.gz")  # Open gzip file
  defer: outFile.close()  # Close file (like a 'final' statement in 'try' block)

  var line: string  # Declare line variable

  # Loop over each line in the file
  while not vcf.atEnd():
    line = vcf.readLine()

    # Cure disease with my VCF file

To install the zip package, I simply ran because it is already in the Nim package library:

> nimble refresh
> nimble install zip
Mark Ebbert
  • 441
  • 3
  • 13
0

I tried to use Nim some time ago to parse a fastq or fastq.gz file.

The code should be available here: https://gitlab.pasteur.fr/bli/qaf_demux/blob/master/Nim/src/qaf_demux.nim

I don't remember exactly how this works, but apparently, I did an import zip/gzipfiles and used newGZFileStream on the input file name to obtain a Stream from which lines can be read using .readLine() in this piece of code:

proc fastqParser(stream: Stream): iterator(): Fastq =
  result = iterator(): Fastq =
    var
      nameLine: string
      nucLine: string
      quaLine: string
    while not stream.atEnd():
      nameLine = stream.readLine()
      nucLine = stream.readLine()
      discard stream.readLine()
      quaLine = stream.readLine()
      yield [nameLine, nucLine, quaLine]

It is used in something that amounts to this piece of code:

let inputFqs = fastqParser(newGZFileStream($inFastqFilename))

Hopefully you can adapt this to your case.

My .nimble file has a requires "zip#head". I suppose this triggers the installation of zip/gzipfiles.

bli
  • 7,549
  • 7
  • 48
  • 94