0

The Variant Schema used by Google Genomics Variant Transform pipelines represents genotypes as nested records in BigQuery - for example:

1000 Genomes variant examples with 4 genotypes

(from: https://bigquery.cloud.google.com/table/genomics-public-data:1000_genomes.variants?pli=1&tab=preview)

I'm having trouble understanding how to write queries that involve relationships between samples - such as:

select all variants where sampleA.genotype=HET and sampleB.genotype=HET and sampleC.genotype=HOM-ALT

or similar queries where sampleA and sampleB are parents of sampleC and you're looking for variants that follow a particular inheritance pattern.

How are people writing these queries with the nested schema?

user553965
  • 1,199
  • 14
  • 15
  • can you clarify what you mean by `sampleA.genotype=HET` - as I see from schema of table - genotype is a repeated integers!? – Mikhail Berlyant Aug 03 '18 at 00:14
  • sorry, I mean heterozygous genotype, so sampleA.genotype=[0, 1] (or equivalently [1, 0]) and HOM-ALT is sampleC.genotype=[1,1] – user553965 Aug 03 '18 at 00:17
  • what defines sampleA vs sampleB vs sampleC? – Mikhail Berlyant Aug 03 '18 at 00:24
  • It's based on meta-data about how these individuals are related to each other. For example, in the data above, HG00261 might be the father of NA12749, and HG00593 might be the mother. HG00150 might be the paternal grandmother, etc. – user553965 Aug 03 '18 at 00:30
  • but that's exactly what I am asking - you should describe the logic of identifying sampleA, sampleB and sampleC – Mikhail Berlyant Aug 03 '18 at 00:33
  • sorry, not sure I fully understand your question. To make it more concrete, I want to write a query that returns all variants that have the following nested records: at least one record with (call.call_set_name="HG00261" and call.genotype=[0,1]) and also at least one record with (call.call_set_name="HG00593" and call.genotype=[0,1]) and also at least one record with (call.call_set_name="NA12749 " and call.genotype=[1,1]) - which would mean this variant is present in the {HG00593, HG00261, NA12749} family and was transmitted from both parents to the child. – user553965 Aug 03 '18 at 00:41
  • i think you just answered my question :o) and thus made your question more specific. I am about to hit the road so most likely someone will answer your question before I will. But if not - i will when get to comp in few hours from now :o) – Mikhail Berlyant Aug 03 '18 at 00:52

1 Answers1

1

I think that would be something like below - have not tested as table is quite expensive - but one run gave zero output meaning that there is no records that meet that specific criteria - but at least you see the logic of how to do such query

SELECT * EXCEPT(cnt)
FROM (
    SELECT reference_name,  start, `end`,
        (SELECT COUNT(1) 
            FROM UNNEST(call) 
            WHERE (call_set_name="HG00261" AND genotype[SAFE_OFFSET(0)] = 0 AND genotype[SAFE_OFFSET(1)] = 1)
            OR (call_set_name="HG00593" AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 0)
            OR (call_set_name="NA12749 " AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 1)
        ) cnt
    FROM `genomics-public-data.1000_genomes.variants` 
)
WHERE cnt = 3 
Mikhail Berlyant
  • 165,386
  • 8
  • 154
  • 230
  • Thanks! I'm having some trouble with syntax errors, but the idea makes sense. – user553965 Aug 03 '18 at 04:07
  • ```SELECT * FROM ( SELECT reference_name, start, (SELECT COUNT(1) FROM UNNEST(call) WHERE (call_set_name="HG00261" AND genotype[SAFE_OFFSET(0)] = 0 AND genotype[SAFE_OFFSET(1)] = 1) OR (call_set_name="HG00593" AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 0) OR (call_set_name="NA12749" AND genotype[SAFE_OFFSET(0)] = 1 AND genotype[SAFE_OFFSET(1)] = 1) ) cnt FROM [genomics-public-data:1000_genomes.variants] ) WHERE cnt = 3 ``` shows ```Encountered " "]" "[SAFE_OFFSET(0)] ""... Was expecting: ")" .``` – user553965 Aug 03 '18 at 04:08
  • instead of [genomics-public-data:1000_genomes.variants] - should be \`genomics-public-data.1000_genomes.variants` . also be aware of cost of this query – Mikhail Berlyant Aug 03 '18 at 04:09
  • tried it, but still getting ```Error: Encountered " "]" "[SAFE_OFFSET(0)] "" at line 5, column 56. Was expecting: ")" ... [Try using standard SQL (https://cloud.google.com/bigquery/docs/reference/standard-sql/enabling-standard-sql)]``` – user553965 Aug 03 '18 at 04:11
  • i see, you need to make sure you run it under STandard SQL – Mikhail Berlyant Aug 03 '18 at 04:12
  • put `#standardSQL` as a first line – Mikhail Berlyant Aug 03 '18 at 04:13
  • or just make sure it is set in Options – Mikhail Berlyant Aug 03 '18 at 04:13