How is it possible to export a Cloud Genomics variantset to BigQuery now that varientsets.export has been deprecated?

Question

I have loaded a variantset into Cloud Genomics and am attempting to export it to BigQuery. The first approach I tried was to use a pipeline as detailed here:

https://cloud.google.com/genomics/docs/how-tos/load-variants

However, 20 minutes into the process, it failed. According to StackDriver error reporting, it appears to be a problem in the VCF file, though I am at a loss to explain how it might be fixed:

ValueError: Invalid record in VCF file. Error: list index out of range
at next (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:476)
at read_records (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:398)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:48)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:44)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:39)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:38)
at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:167)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:609)

So I continued to search for other options. I turned to the API:

https://cloud.google.com/genomics/reference/rest/v1/variantsets/export

I made sure that my account was a BigQuery admin and an owner for the Genoimcs variantset. I used the following parameters:

{
  "projectId": "my-project",
  "format": "FORMAT_BIGQUERY",
  "bigqueryDataset": "my_dataset",
  "bigqueryTable": "new_table"
}

Upon submitting, I receive the following error:

{
  "error": {
    "code": 500,
    "message": "Unknown Error.",
    "status": "UNKNOWN"
  }
}

I have also tried this from the command line: gcloud alpha genomics variantsets export variantset_id bigquery_table --bigquery-dataset=my-dataset --bigquery-project=my-project.

But that gives me a 500 Unknown Error as well. I've been going back on this for several hours, and the documentation is quite sparse.

Please, what could I be missing?

When running the pipeline, you should be able to see the error logs in the Dataflow Console. Does it work with a [public dataset](https://cloud.google.com/genomics/docs/public-datasets/illumina-platinum-genomes)? — Guillem Xercavins, May 13 '18 at 10:15
A public dataset fails as well. I took a look at the Dataflow Console error logs. There are four steps, each of them laabled "Failed": ReadfromVcf, FilterVariants, ProcessVaraints, VarianttoBigQuery. Clicking on each of them displays "No entries found for selected log." The spelling of "ProcessVaraints" is not my typo, but Google's. — EVMPMOArchitect, May 13 '18 at 14:13
I have updated my original post to include a StackDriver error. The problem is in the VCF file, itself. If you have any idea how this might be resolved, please let me know. — EVMPMOArchitect, May 13 '18 at 14:22

score 0 · Answer 1 · edited May 13 '18 at 14:14

0

Thanks for asking this question. We deprecated the Variants API six months ago because we found that the #1 thing people did with it was BQ export.

So, we released a brand new FOSS tool, Variant Transforms, that simply does this task, but is more performant.

Link

We actually just had a new release this week. Please take a look and let us know what you think.

In addition to the code & docs, you'll see much our product roadmap there, too.

Please comment and share your thoughts!

FYI, we'll be decommissioning the Variants API soon.

Jonathan (PM, biomedical data, Google Cloud)

edited May 13 '18 at 14:14

Patrik

2,695
1
21
36

answered May 13 '18 at 13:36

sheffi

1
2

Thank you for the clarification. Unfortunately, the FOSS tool does not work (see above). Dataflow Console logs show "Failed" at every step of the process. – EVMPMOArchitect May 13 '18 at 14:14

score 0 · Accepted Answer · answered May 13 '18 at 16:15

It looks like one or more lines in the VCF file are malformed and do not conform to the spec.

We just released a preprocessor/validator tool that shows a report of all such malformed records. Please give it a try: https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/vcf_files_preprocessor.md (please run with --report_all_conflicts to ensure you get the full report).

If it turns out that only a few records are malformed, then you can either fix them manually in the VCF file or run the vcf_to_bq pipeline with --allow_malformed_records, which will skip the malformed ones (just logs them) and load the rest.

How is it possible to export a Cloud Genomics variantset to BigQuery now that varientsets.export has been deprecated?

2 Answers2