Questions tagged [vcf-variant-call-format]

The Variant Call Format (VCF) specifies the format of a text file used in bioinformatics for storing gene sequence variations. Do not use this tag for vCard file format standard for electronic business cards. The documentation of this format can be found here: https://samtools.github.io/hts-specs/VCFv4.2.pdf

Variant Call Format - Wikipedia. The format has been developed with the advent of large-scale genotyping and DNA sequencing projects, such as the 1000 Genomes Project.

The Variant Call Format (VCF) Version 4.2 Specification, 25 Jun 2020 (pdf); the master version of this document can be found at https://github.com/samtools/hts-specs.

The VCF specification

VCF is a text file format (most likely stored in a compressed manner). It contains meta-information lines, a header line, and then data lines each containing information about a position in the genome. The format also has the ability to contain genotype information on samples for each position.

195 questions
20
votes
6 answers

R extract part of string

I have a question about extracting a part of a string. For example I have a string like this: a <- "DP=26;AN=2;DB=1;AC=1;MQ=56;MZ=0;ST=5:10,7:2;CQ=SYNONYMOUS_CODING;GN=NOC2L;PA=1^1:0.720&2^1:0" I need to extract everything between GN= and ;.So here…
Lisann
  • 5,705
  • 14
  • 41
  • 50
13
votes
1 answer

Extract sample data from VCF files

I have a large Variant Call format (VCF) file (> 4GB) which has data for several samples. I have browsed Google, Stackoverflow as well as tried the VariantAnnotation package in R to somehow extract data only for a particular sample, but have not…
rokosir
  • 145
  • 1
  • 1
  • 7
9
votes
3 answers

How to read vcf file in R

I have this VCF format file, I want to read this file in R. However, this file contains some redundant lines which I want to skip. I want to get something like in the result where the row starts with the line matching #CHROM. This is what I have…
MAPK
  • 5,635
  • 4
  • 37
  • 88
8
votes
1 answer

Read table with comment lines starting with "##"

I'm struggling to read my tables in Variant Call Format (VCF) with R. Each file has some comment lines starting with ##, and then the header starting with #. ## contig= ## contig= #CHROM POS ID REF ALT…
Slavskii Sergei
  • 138
  • 1
  • 9
6
votes
2 answers

How to aggregate values over a bigger than RAM gzip'ed csv file?

For starters I am new to bioinformatics and especially to programming, but I have built a script that will go through a so-called VCF file (only the individuals are included, one clumn = one individual), and uses a search string to find out for…
visse226
  • 139
  • 1
  • 7
6
votes
2 answers

How to convert vcf file to ped file using plink?

I am trying to convert a .vcf file to a .ped file using plink. I have read some manuals and posts online, but it seems that no one specifically mentions how to convert vcf to ped. I am hoping that there may be some experts here who have experience…
NeverBe
  • 107
  • 1
  • 1
  • 7
5
votes
4 answers

How do I remove duplicated SNPs using PLink?

I am working with PLINK to analyse genome-wide data. Does anyone know how to remove duplicated SNPs?
user1236418
  • 163
  • 2
  • 2
  • 9
5
votes
2 answers

How to combine all chromosomes in a single file

I downloaded 1000 genomes data (chromosome 1 -22), which is in VCF format. How I can combine all chromosomes in a single files? Should I first convert all chromosomes into plink binary files and then do the --bmerge mmerge-list? Or is there any…
bha
  • 77
  • 2
  • 7
4
votes
0 answers

VCF4.2 file not recognised by GATK

Ive seen a lot having the same problem, but I havnt found a solution yet. I have supplied 24 VCF4.1 files (http://evs.gs.washington.edu/EVS/) to GATKs CombineVariants. I get this error: ##### ERROR MESSAGE: Invalid command line: No tribble type was…
4
votes
3 answers

construct DNA sequence based on variation and human reference

The 1000 genome project provides us information about "variation" of thousands people's DNA sequence against the human reference DNA sequence. The variation is stored in VCF file format. Basically, for each person in that project, we can get his/her…
JRH
  • 53
  • 5
3
votes
2 answers

How to read a vcf.gz file in Python?

I have a file in the vcf.gz format (e.g. file_name.vcf.gz) - and I need to read it somehow in Python. I understood that first I have to decompress it and then to read it. I found this solution, but it doesn't work for me unfortunately. Even for the…
Elisa L.
  • 267
  • 1
  • 8
3
votes
1 answer

system commands in future/promises in Rshiny

I have the below server.R code in shiny app where a system command is run inside future which gives an output.vcf file. Upon creation of this file the progress bar is removed and a second system command is run to convert out.vcf to out.txt The…
chas
  • 1,565
  • 5
  • 26
  • 54
3
votes
1 answer

Running with IDLE vs running the script

So I have some Python Code (Running Python 2.7.12), which uses VEP to annotate a vcf file against specific transcripts. When I run the script by double clicking on it (Or running it from command prompt) it gives the following…
KJTHoward
  • 806
  • 4
  • 16
3
votes
2 answers

How to speedup bulk importing into google cloud datastore with multiple workers?

I have an apache-beam based dataflow job to read using vcf source from a single text file (stored in google cloud storage), transform text lines into datastore Entities and write them into the datastore sink. The workflow works fine but the cons I…
3
votes
2 answers

Use wildcard on params

I try to use one tool and I need to use a wildcard present on input. This is an example: aDict = {"120":"121" } #tumor : normal rule all: input: expand("{case}.mutect2.vcf",case=aDict.keys()) def get_files_somatic(wildcards): case =…
mau_who
  • 315
  • 2
  • 13
1
2 3
12 13