Disclosure: I work full-time on the Hail project.
Hi there! Welcome to programming and bioinformatics!
amirouche correctly identifies that you need some sort of "streaming" or
"line-by-line" algorithm to handle data that is too large to fit in the RAM of
your machine. Unfortunately, if you are limited to python without libraries, you
have to manually chunk the file and handle parsing of a VCF.
The Hail project is a free, open-source tool for scientists
with genetic data too big to fit in RAM all the way up to too big to fit on one
machine (i.e. tens of terabytes of compressed VCF data). Hail can take advantage
of all the cores on one machine or all the cores on a cloud of machines. Hail
runs on Mac OS X and most flavors of GNU/Linux. Hail exposes a statistical
genetics domain-specific language which makes your question much shorter to
express.
The Simplest Answer
The most faithful translation of your python code to Hail is this:
/path/to/hail importvcf -f YOUR_FILE.vcf.gz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
I ran the above command on my dual-core laptop on a 2.0GB file:
# ls -alh profile225.vcf.bgz
-rw-r--r-- 1 dking 1594166068 2.0G Aug 25 15:43 profile225.vcf.bgz
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:=======================================================> (63 + 2) / 65]hail: info: Coerced sorted dataset
hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 1:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: while importing:
file:/Users/dking/projects/hail-data/profile225.vcf.bgz import clean
hail: info: timing:
importvcf: 34.211s
annotatesamples expr: 6m52.4s
annotatesamples expr: 21.399ms
exportsamples: 121.786ms
total: 7m26.8s
# head sampleInfo.tsv
sample pHomRef pHet nHom nHet nCalled
HG00096 9.49219e-01 5.07815e-02 212325 11359 223684
HG00097 9.28419e-01 7.15807e-02 214035 16502 230537
HG00099 9.27182e-01 7.28184e-02 211619 16620 228239
HG00100 9.19605e-01 8.03948e-02 214554 18757 233311
HG00101 9.28714e-01 7.12865e-02 214283 16448 230731
HG00102 9.24274e-01 7.57260e-02 212095 17377 229472
HG00103 9.36543e-01 6.34566e-02 209944 14225 224169
HG00105 9.29944e-01 7.00564e-02 214153 16133 230286
HG00106 9.25831e-01 7.41687e-02 213805 17128 230933
Wow! Seven minutes for 2GB, that's slow! Unfortunately, this is because VCFs
aren't a great format for data analysis!
Optimizing the Storage Format
Let's convert to Hail's optimized storage format, a VDS, and re-run the command:
# ../hail/build/install/hail/bin/hail importvcf -f profile225.vcf.bgz write -o profile225.vds
hail: info: running: importvcf -f profile225.vcf.bgz
[Stage 0:========================================================>(64 + 1) / 65]hail: info: Coerced sorted dataset
hail: info: running: write -o profile225.vds
[Stage 1:> (0 + 4) / 65]
[Stage 1:========================================================>(64 + 1) / 65]
# ../hail/build/install/hail/bin/hail read -i profile225.vds \
annotatesamples expr -c \
'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()' \
annotatesamples expr -c \
'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled' \
exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 1:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: annotatesamples expr -c 'sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()'
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.*' -o sampleInfo.tsv
hail: info: timing:
read: 2.969s
annotatesamples expr: 1m20.4s
annotatesamples expr: 21.868ms
exportsamples: 151.829ms
total: 1m23.5s
About five times faster! With regard to larger scale, running the same command on the Google cloud on a VDS representing the full VCF the 1000 Genomes Project (2535 whole genomes, about 315GB compressed) took 3m42s using 328 worker cores.
Using a Hail Built-in
Hail also has a sampleqc
command which computes most of what you want (and
more!):
../hail/build/install/hail/bin/hail read -i profile225.vds \
sampleqc \
annotatesamples expr -c \
'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled' \
exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: running: read -i profile225.vds
[Stage 0:> (0 + 0) / 4]SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
[Stage 1:============================================> (3 + 1) / 4]hail: info: running: sampleqc
[Stage 2:========================================================>(64 + 1) / 65]hail: info: running: annotatesamples expr -c 'sa.myqc.pHomRef = (sa.qc.nHomRef + sa.qc.nHomVar) / sa.qc.nCalled,
sa.myqc.pHet= sa.qc.nHet / sa.qc.nCalled'
hail: info: running: exportsamples -c 'sample = s, sa.myqc.*, nHom = sa.qc.nHomRef + sa.qc.nHomVar, nHet = sa.qc.nHet, nCalled = sa.qc.nCalled' -o sampleInfo.tsv
hail: info: timing:
read: 2.928s
sampleqc: 1m27.0s
annotatesamples expr: 229.653ms
exportsamples: 353.942ms
total: 1m30.5s
Installing Hail
Installing Hail is pretty easy and we have docs to help
you. Need more help? You can get
real-time support in the Hail users chat room or, if you prefer forums, the Hail
discourse (both are linked to from the home page, unfortunately I don't have
enough reputation to create real links).
The Near Future
In the very near future (less than one month from today), the Hail team will
complete a Python API which will allow you to express the first snippet as:
result = importvcf("YOUR_FILE.vcf.gz")
.annotatesamples('sa.nCalled = gs.filter(g => g.isCalled).count(),
sa.nHom = gs.filter(g => g.isHomRef || g.isHomVar).count(),
sa.nHet = gs.filter(g => g.isHet).count()')
.annotatesamples('sa.pHom = sa.nHom / sa.nCalled,
sa.pHet = sa.nHet / sa.nCalled')
for (x in result.sampleannotations):
print("Sample " + x +
" nCalled " + x.nCalled +
" nHom " + x.nHom +
" nHet " + x.nHet +
" percent Hom " + x.pHom * 100 +
" percent Het " + x.pHet * 100)
result.sampleannotations.write("sampleInfo.tsv")
EDIT: Added the output of head
on the tsv file.
EDIT2: Latest Hail doesn't need biallelic for sampleqc
EDIT3: Note about scaling to the cloud with hundreds of cores