We are new to Mongodb
, and would like to use it to insert genomic data (165M entries), and retrieve this data by genomic coordinates (ranges). Below is the type of data we store in a single table. Where column names are chrom,chromStart,chromEnd,datasetid,target,biotype
chr1 9903 10282 ENCSR440COG ZNF239 HEK293
chr1 9904 10252 ENCSR721QZV ZSCAN18 HEK293
chr1 9905 10132 ENCSR241LIH AFF1 K-562
chr1 9905 10238 ENCSR211GNP ZSCAN4 HEK293
chr1 9905 10241 ENCSR776LDJ ZNF645 HEK293
chr1 9905 10243 ENCSR042TWZ SNIP1 MCF-7
chr2 938173 938703 ENCSR000BUL MAX MCF-7
chr2 938174 938376 ENCSR108TYQ GATAD1 Hep-G2
chr3 938174 938412 ENCSR887MXT ZHX1 HeLa-S3
chr3 945236 945377 GSE46055 KDM5B SUM185_SHCTCF
chr4 945236 945488 ENCSR000BPU ETS1 A-549
chr4 945240 945501 GSE76494 CTCF HEK293
chr4 950008 951114 GSE67783 STAG1 HSPC
chr4 950013 950185 ENCSR000BQT TCF3 GM12878
chr4 950015 950797 ENCSR115BLD KDM1A Hep-G2
chr4 950024 950693 GSE88734 ZEB1 MIA-PaCa-2
chr4 950028 950565 ENCSR753GIA TARDBP HEK293T
The type of genomic ranges queries would be :
db.hsap_all_peaks.find({ chrom: "chr1", chromStart: {$gte: 9905}, chromEnd:{$lte: 10243}} ).count()
db.hsap_all_peaks.find({ chrom: "chr4", chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} ).pretty()
In the long run, we plan to queries on ranges but also on values like:
db.hsap_all_peaks.find({ chrom: "chr4", chromStart: {$gte: 950013}, chromEnd:{$lte: 950693}} , target: "KDM1A").pretty()
This is how we created indexes for the coordinates :
db.hsap_all_peaks.createIndex(
{chrom:1}
)
db.hsap_all_peaks.createIndex(
{chrom:1,chromStart:1,chromEnd:1}
)
However, the queries are very long to execute, and it seems that the indexes for chromStart
, chromEnd
are not working.
Hence my question: what would be the best way to create indexes here?
Extra information :
> db.hsap_all_peaks.getIndexes()
[
{
"v" : 2,
"key" : {
"_id" : 1
},
"name" : "_id_"
},
{
"v" : 2,
"key" : {
"chrom" : 1,
"chromStart" : 1,
"chromEnd" : 1
},
"name" : "chrom_1_chromStart_1_chromEnd_1"
},
{
"v" : 2,
"key" : {
"chrom" : 1
},
"name" : "chrom_1"
}
]
If you want to recreate a similar'ish table :
wget http://remap.univ-amu.fr/storage/remap2020/hg38/MACS2/remap2020_all_macs2_hg38_v1_0.bed.gz
gunzip remap2020_all_macs2_hg38_v1_0.bed.gz
mongoimport -d databaseName -c hsap_all_peaks --type tsv --file remap2020_all_macs2_hg38_v1_0.bed -f chrom,chromStart,chromEnd,name,score,strand,thickStart,thickEnd,itemRgb --numInsertionWorkers 2
Explain()
output for a classic query :
db.hsap_all_peaks.find({ chrom: "chr2", chromStart: {$gte: 50967094}, chromEnd:{$lte: 50970983} } ).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "remap2020.hsap_all_peaks",
"indexFilterSet" : false,
"parsedQuery" : {
"$and" : [
{
"chrom" : {
"$eq" : "chr2"
}
},
{
"chromEnd" : {
"$lte" : 50970983
}
},
{
"chromStart" : {
"$gte" : 50967094
}
}
]
},
"queryHash" : "2A452369",
"planCacheKey" : "C93EF492",
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"chrom" : 1,
"chromStart" : 1,
"chromEnd" : 1
},
"indexName" : "chrom_1_chromStart_1_chromEnd_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"chrom" : [ ],
"chromStart" : [ ],
"chromEnd" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
],
"chromStart" : [
"[50967094.0, inf.0]"
],
"chromEnd" : [
"[-inf.0, 50970983.0]"
]
}
}
},
"rejectedPlans" : [
{
"stage" : "FETCH",
"filter" : {
"$and" : [
{
"chromEnd" : {
"$lte" : 50970983
}
},
{
"chromStart" : {
"$gte" : 50967094
}
}
]
},
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"chrom" : 1
},
"indexName" : "chrom_1",
"isMultiKey" : false,
"multiKeyPaths" : {
"chrom" : [ ]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"chrom" : [
"[\"chr2\", \"chr2\"]"
]
}
}
}
]
},
"serverInfo" : {
"host" : "sormiou.local",
"port" : 27017,
"version" : "4.4.3",
"gitVersion" : "913d6b62acfbb344dde1b116f4161360acd8fd13"
},
"ok" : 1
}