How to read a vcf.gz file in Python?

Question

I have a file in the vcf.gz format (e.g. file_name.vcf.gz) - and I need to read it somehow in Python.

I understood that first I have to decompress it and then to read it. I found this solution, but it doesn't work for me unfortunately. Even for the first line (bgzip file_name.vcf or tabix file_name.vcf.gz) it says SyntaxError: invalid syntax.

Could you help me please?

https://pyvcf.readthedocs.io/en/latest/ or https://github.com/brentp/cyvcf2 — user438383, Jun 10 '22 at 13:15

score 0 · Answer 1 · answered Jun 16 '22 at 18:18

0

Both cyvcf and pyvcf can read vcf files, but cyvcf is much faster and is more actively maintained.

answered Jun 16 '22 at 18:18

basesorbytes

85
7

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Jun 17 '22 at 12:25

score -2 · Answer 2 · edited May 06 '23 at 08:04

The best approach is by using programs that do this for you as mentioned by basesorbytes. However, if you want your own code you could use this approach


# Import libraries

import gzip
import pandas as pd

class ReadFile():
    '''
    This class read a VCF file
    and does some data manipulation
    the outout is the full data found
    in the input of this class
    the filtering process happens
    in the following step
    '''
    def __init__(self,file_path):
        '''
        This is the built-in constructor method
        '''
        self.file_path = file_path

    def load_data(self):
        '''
        1) Convert VCF file into  data frame
           Read  header of the body dynamically and assign dtype
           
        '''

        # Open the VCF file and read line by line
        with io.TextIOWrapper(gzip.open(self.file_path,'r')) as f:

            lines =[l for l in f if not l.startswith('##')]
            # Identify columns name line and save it into a dict
            # with values as dtype
            dynamic_header_as_key = []
            for liness in f:
                if liness.startswith("#CHROM"):
                    dynamic_header_as_key.append(liness)
                    # Declare dtypes
            values = [str,int,str,str,str,int,str,str,str,str]
            columns2detype = dict(zip(dynamic_header_as_key,values))

            vcf_df = pd.read_csv(
                io.StringIO(''.join(lines)),
                dtype=columns2detype,
                sep='\t'
            ).rename(columns={'#CHROM':'CHROM'})

       return vcf_df

How to read a vcf.gz file in Python?

2 Answers2