How to aggregate values over a bigger than RAM gzip'ed csv file?

Question

For starters I am new to bioinformatics and especially to programming, but I have built a script that will go through a so-called VCF file (only the individuals are included, one clumn = one individual), and uses a search string to find out for every variant (line) whether the individual is homozygous or heterozygous.

This script works, at least on small subsets, but I know it stores everything in memory. I would like to do this on very large zipped files (of even whole genomes), but I do not know how to transform this script into a script that does everything line by line (because I want to count through whole columns I just do not see how to solve that).

So the output is 5 things per individual (total variants, number homozygote, number heterozygote, and proportions of homo- and heterozygotes). See the code below:

#!usr/bin/env python
import re
import gzip

subset_cols = 'subset_cols_chr18.vcf.gz'
#nuc_div = 'nuc_div_chr18.txt'

gz_infile = gzip.GzipFile(subset_cols, "r")  
#gz_outfile = gzip.GzipFile(nuc_div, "w") 

# make a dictionary of the header line for easy retrieval of elements later on

headers = gz_infile.readline().rstrip().split('\t')             
print headers                                                   

column_dict = {}                                        
for header in headers:
        column_dict[header] = []                        
for line in gz_infile:                                     
        columns = line.rstrip().split('\t')             
        for i in range(len(columns)):                   
                c_header=headers[i]                     
                column_dict[c_header].append(columns[i])
#print column_dict

for key in column_dict:                         
        number_homozygotes = 0          
        number_heterozygotes = 0        

        for values in column_dict[key]: 
                SearchStr = '(\d)/(\d):\d+,\d+:\d+:\d+:\d+,\d+,\d+'     
#this search string contains the regexp (this regexp was tested)
                Result = re.search(SearchStr,values)                    
                if Result:
#here, it will skip the missing genoytypes ./.
                        variant_one = int(Result.group(1))              
                        variant_two = int(Result.group(2))              

                        if variant_one == 0 and variant_two == 0:
                                continue
                        elif variant_one == variant_two:                  
#count +1 in case variant one and two are equal (so 0/0, 1/1, etc.)
                                number_homozygotes += 1
                        elif variant_one != variant_two:
#count +1 in case variant one is not equal to variant two (so 1/0, 0/1, etc.)
                                number_heterozygotes += 1

        print "%s homozygotes %s" % (number_homozygotes, key) 
        print "%s heterozygotes %s" % (number_heterozygotes,key)

        variants = number_homozygotes + number_heterozygotes
        print "%s variants" % variants

        prop_homozygotes = (1.0*number_homozygotes/variants)*100
        prop_heterozygotes = (1.0*number_heterozygotes/variants)*100

        print "%s %% homozygous %s" % (prop_homozygotes, key)
        print "%s %% heterozygous %s" % (prop_heterozygotes, key)

Any help will be much appreciated so I can go on investigating large datasets, thank you :)

The VCF file by the way looks something like this: INDIVIDUAL_1 INDIVIDUAL_2 INDIVIDUAL_3 0/0:9,0:9:24:0,24,221 1/0:5,4:9:25:25,0,26 1/1:0,13:13:33:347,33,0

This is then the header line with the individual ID names (I have in total 33 individuals with more complicated ID tags, I simplified here) and then I have a lot of these information lines with the same specific pattern. I am only interested in the first part with the slash so hence the regular epxression.

It would help if you could edit the question to include a sample from the top of a VCF file, and the expected results. — Martin Evans, Nov 10 '16 at 14:28
Basically what you want is to decompress step by step a csv file and yield rows — amirouche, Nov 11 '16 at 12:19
I added simplified the data in the VCF file but by seeing your code below I figured you already figured out yourself :) Usually there is also regular information per line (also in columns, before the individuals columns), but I will filter them out before applying this script. — visse226, Nov 14 '16 at 15:21
In addition to previous comment: I use python 2.7.5 but I can also get to python 3 if necessary — visse226, Nov 14 '16 at 15:28

Daniel King · Answer 1 · 2016-11-11T20:38:51.110

Disclosure: I work full-time on the Hail project.

Hi there! Welcome to programming and bioinformatics!

amirouche correctly identifies that you need some sort of "streaming" or "line-by-line" algorithm to handle data that is too large to fit in the RAM of your machine. Unfortunately, if you are limited to python without libraries, you have to manually chunk the file and handle parsing of a VCF.

The Hail project is a free, open-source tool for scientists with genetic data too big to fit in RAM all the way up to too big to fit on one machine (i.e. tens of terabytes of compressed VCF data). Hail can take advantage of all the cores on one machine or all the cores on a cloud of machines. Hail runs on Mac OS X and most flavors of GNU/Linux. Hail exposes a statistical genetics domain-specific language which makes your question much shorter to express.

The Simplest Answer

The most faithful translation of your python code to Hail is this:

/path/to/hail importvcf -f YOUR_FILE.vcf.gz \
  annotatesamples expr -c \
    'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()'
  annotatesamples expr -c \
    'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled' \
  exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv

I ran the above command on my dual-core laptop on a 2.0GB file:

# ls -alh profile225.vcf.bgz
-rw-r--r--  1 dking  1594166068   2.0G Aug 25 15:43 profile225.vcf.bgz
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
  annotatesamples expr -c \
    'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()' \
  annotatesamples expr -c \
    'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled' \
  exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
     sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
     sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom =  sa.nHom / sa.nCalled,
     sa.pHet =  sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: while importing:
    file:/Users/dking/projects/hail-data/profile225.vcf.bgz  import clean
hail: info: timing:
  importvcf: 34.211s
  annotatesamples expr: 6m52.4s
  annotatesamples expr: 21.399ms
  exportsamples: 121.786ms
  total: 7m26.8s
# head sampleInfo.tsv 
sample  pHomRef pHet    nHom    nHet    nCalled
HG00096 9.49219e-01 5.07815e-02 212325  11359   223684
HG00097 9.28419e-01 7.15807e-02 214035  16502   230537
HG00099 9.27182e-01 7.28184e-02 211619  16620   228239
HG00100 9.19605e-01 8.03948e-02 214554  18757   233311
HG00101 9.28714e-01 7.12865e-02 214283  16448   230731
HG00102 9.24274e-01 7.57260e-02 212095  17377   229472
HG00103 9.36543e-01 6.34566e-02 209944  14225   224169
HG00105 9.29944e-01 7.00564e-02 214153  16133   230286
HG00106 9.25831e-01 7.41687e-02 213805  17128   230933

Wow! Seven minutes for 2GB, that's slow! Unfortunately, this is because VCFs aren't a great format for data analysis!

Optimizing the Storage Format

Let's convert to Hail's optimized storage format, a VDS, and re-run the command:

# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
hail: info: running: write -o profile225.vds
[Stage 1:>                                                         (0 + 4) / 65]
[Stage 1:========================================================>(64 + 1) / 65]
# ../hail/build/install/hail/bin/hail read -i profile225.vds \
       annotatesamples expr -c \
         'sa.nCalled = gs.filter(g => g.isCalled).count(),
          sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
          sa.nHet = gs.filter(g => g.isHet).count()' \
       annotatesamples expr -c \
         'sa.pHom =  sa.nHom / sa.nCalled,
          sa.pHet =  sa.nHet / sa.nCalled' \
       exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 1:>                                                          (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================>              (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
         sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
         sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom =  sa.nHom / sa.nCalled,
         sa.pHet =  sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: timing:
  read: 2.969s
  annotatesamples expr: 1m20.4s
  annotatesamples expr: 21.868ms
  exportsamples: 151.829ms
  total: 1m23.5s

About five times faster! With regard to larger scale, running the same command on the Google cloud on a VDS representing the full VCF the 1000 Genomes Project (2535 whole genomes, about 315GB compressed) took 3m42s using 328 worker cores.

Using a Hail Built-in

Hail also has a sampleqc command which computes most of what you want (and more!):

../hail/build/install/hail/bin/hail  read -i profile225.vds \
      sampleqc \
      annotatesamples expr -c \
        'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
         sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
      exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 0:>                                                          (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================>              (3 + 1) / 4]hail: info: running: sampleqc
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
         sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: timing:
  read: 2.928s
  sampleqc: 1m27.0s
  annotatesamples expr: 229.653ms
  exportsamples: 353.942ms
  total: 1m30.5s

Installing Hail

Installing Hail is pretty easy and we have docs to help you. Need more help? You can get real-time support in the Hail users chat room or, if you prefer forums, the Hail discourse (both are linked to from the home page, unfortunately I don't have enough reputation to create real links).

The Near Future

In the very near future (less than one month from today), the Hail team will complete a Python API which will allow you to express the first snippet as:

result = importvcf("YOUR_FILE.vcf.gz")
  .annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),
                    sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
                    sa.nHet = gs.filter(g => g.isHet).count()')
  .annotatesamples('sa.pHom =  sa.nHom / sa.nCalled,
                    sa.pHet =  sa.nHet / sa.nCalled')

for (x in result.sampleannotations):
  print("Sample " + x +
        " nCalled " + x.nCalled +
        " nHom " + x.nHom +
        " nHet " + x.nHet +
        " percent Hom " + x.pHom * 100 +
        " percent Het " + x.pHet * 100)

result.sampleannotations.write("sampleInfo.tsv")

EDIT: Added the output of head on the tsv file.

EDIT2: Latest Hail doesn't need biallelic for sampleqc

EDIT3: Note about scaling to the cloud with hundreds of cores

Wow seriously thanks for this elaborated help!! It language still looks a little overwhelming towards me (I am new to all this and have never heard of HAIL). I'll see whether I can work with this thanks! — visse226, Nov 14 '16 at 15:13
Sorry, I accidentally hit enter. If you need any help installing hail or understanding the language, head to the chat room I linked above. I think you'll find the questions you want to ask easier to express in hail as compared to python. — Daniel King, Nov 14 '16 at 16:25

amirouche · Answer 2 · 2018-09-13T08:35:33.067

To be able to process a bigger than RAM dataset you need to rework your algorithm to process the data line by line right now you are processing every column.

But before that you need a way to stream the rows from the gzip'ed file.

The following Python 3 code does that:

"""https://stackoverflow.com/a/40548567/140837"""
#!/usr/bin/env python3
import zlib
from mmap import PAGESIZE


CHUNKSIZE = PAGESIZE


# This is a generator that yields *decompressed* chunks from
# a gzip file. This is also called a stream or lazy list.
# It's done like so to avoid to have the whole file into memory
# Read more about Python generators to understand how it works.
# cf. `yield` keyword.
def gzip_to_chunks(filename):
    decompressor = zlib.decompressobj(zlib.MAX_WBITS + 16)
    with open(filename, 'rb') as f:
        chunk = f.read(CHUNKSIZE)

        while chunk:
            out = decompressor.decompress(chunk)
            yield out
            chunk = f.read(CHUNKSIZE)

        out = decompressor.flush()

        yield out


# Again the following is a generator (see the `yield` keyword).
# What id does is iterate over an *iterable* of strings and yields
# rows from the file

# (hint: `gzip_to_chunks(filename)` returns a generator of strings)
# (hint: a generator is also an iterable)

# You can verify that by calling `chunks_to_rows` with a list of
# strings, where every strings is a chunk of the VCF file.
# (hint: a list is also an iterable)

# inline doc follows
def chunks_to_rows(chunks):
    row = b''  # we will add the chars making a single row to this variable
    for chunk in chunks:  # iterate over the strings/chuncks yielded by gzip_to_chunks
        for char in chunk:  # iterate over all chars from the string
            if char == b'\n'[0]:  # hey! this is the end of the row!
                yield row.decode('utf8').split('\t')  # the row is complete, yield!
                row = b''  # start a new row
            else:
                row += int.to_bytes(char, 1, byteorder='big')  # Otherwise we are in the middle of the row
        # at this point the program has read all the chunk
    # at this point the program has read all the file without loading it fully in memory at once
    # That said, there's maybe still something in row
    if row:
        yield row.decode('utf-8').split('\t')  # yield the very last row if any


for e in chunks_to_rows(gzip_to_chunks('conceptnet-assertions-5.6.0.csv.gz')):
    uid, relation, start, end, metadata = e
    print(start, relation, end)

EDIT: rework the answer and make it work on concetpnet's tsv file that is gziped

This certainly looks like something I could figure out (it does not look too complicated, I am still a beginner) so thanks a lot!! Why do you use that particular chunk size? And does this mean I get output for every chunk size? — visse226, Nov 14 '16 at 15:11
`CHUNKSIZE` is an arbitrary number, actually it's better if it match `mmap.PAGESIZE` I will update my answer with that. It's the number of characters that are read from the hard disk at once. The operating system use `mmap.PAGESIZE` to read the file anyway... Say for instance `some_file.read(1)` the OS will read (and cache) `mmap.PAGESIZE` chars from the hard disk and only return 1 character. The following call to `some_file.read(1)` will not hit the harddisk since the OS cached some part of the file. Anyway, this is detail. — amirouche, Nov 14 '16 at 19:54
@visse226 Thanks for remind me that you are a beginner I will update the code with more comments. Let me know what you think. — amirouche, Nov 14 '16 at 19:56
I will add some tests case for each function. Be aware that this non trivial code for beginner because you prolly never met generators before. Generators are very required to handle this kind of problems (bigger than ram) properly. — amirouche, Nov 14 '16 at 20:16
Don't forget to upvote and/or mark the question that solve your issue best ;) — amirouche, Nov 14 '16 at 20:41
*And does this mean I get output for every chunk size?* Basically `CHUNKSIZE` is the speed at which the program will read the gz file. Nonetheless, it will read all the file. And you will get the result for ALL the file. — amirouche, Nov 14 '16 at 20:42