0

I have loaded a variantset into Cloud Genomics and am attempting to export it to BigQuery. The first approach I tried was to use a pipeline as detailed here:

https://cloud.google.com/genomics/docs/how-tos/load-variants

However, 20 minutes into the process, it failed. According to StackDriver error reporting, it appears to be a problem in the VCF file, though I am at a loss to explain how it might be fixed:

ValueError: Invalid record in VCF file. Error: list index out of range
at next (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:476)
at read_records (/usr/local/lib/python2.7/dist-packages/gcp_variant_transforms/beam_io/vcfio.py:398)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:48)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:44)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:39)
at dataflow_worker.native_operations.NativeReadOperation.start (native_operations.py:38)
at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:167)
at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:609)

So I continued to search for other options. I turned to the API:

https://cloud.google.com/genomics/reference/rest/v1/variantsets/export

I made sure that my account was a BigQuery admin and an owner for the Genoimcs variantset. I used the following parameters:

{
  "projectId": "my-project",
  "format": "FORMAT_BIGQUERY",
  "bigqueryDataset": "my_dataset",
  "bigqueryTable": "new_table"
}

Upon submitting, I receive the following error:

{
  "error": {
    "code": 500,
    "message": "Unknown Error.",
    "status": "UNKNOWN"
  }
}

I have also tried this from the command line: gcloud alpha genomics variantsets export variantset_id bigquery_table --bigquery-dataset=my-dataset --bigquery-project=my-project.

But that gives me a 500 Unknown Error as well. I've been going back on this for several hours, and the documentation is quite sparse.

Please, what could I be missing?

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • When running the pipeline, you should be able to see the error logs in the Dataflow Console. Does it work with a [public dataset](https://cloud.google.com/genomics/docs/public-datasets/illumina-platinum-genomes)? – Guillem Xercavins May 13 '18 at 10:15
  • A public dataset fails as well. I took a look at the Dataflow Console error logs. There are four steps, each of them laabled "Failed": ReadfromVcf, FilterVariants, ProcessVaraints, VarianttoBigQuery. Clicking on each of them displays "No entries found for selected log." The spelling of "ProcessVaraints" is not my typo, but Google's. – EVMPMOArchitect May 13 '18 at 14:13
  • I have updated my original post to include a StackDriver error. The problem is in the VCF file, itself. If you have any idea how this might be resolved, please let me know. – EVMPMOArchitect May 13 '18 at 14:22

2 Answers2

0

Thanks for asking this question. We deprecated the Variants API six months ago because we found that the #1 thing people did with it was BQ export.

So, we released a brand new FOSS tool, Variant Transforms, that simply does this task, but is more performant.

Link

We actually just had a new release this week. Please take a look and let us know what you think.

In addition to the code & docs, you'll see much our product roadmap there, too.

Please comment and share your thoughts!

FYI, we'll be decommissioning the Variants API soon.

Jonathan (PM, biomedical data, Google Cloud)

Patrik
  • 2,695
  • 1
  • 21
  • 36
sheffi
  • 1
  • 2
  • Thank you for the clarification. Unfortunately, the FOSS tool does not work (see above). Dataflow Console logs show "Failed" at every step of the process. – EVMPMOArchitect May 13 '18 at 14:14
0

It looks like one or more lines in the VCF file are malformed and do not conform to the spec.

We just released a preprocessor/validator tool that shows a report of all such malformed records. Please give it a try: https://github.com/googlegenomics/gcp-variant-transforms/blob/master/docs/vcf_files_preprocessor.md (please run with --report_all_conflicts to ensure you get the full report).

If it turns out that only a few records are malformed, then you can either fix them manually in the VCF file or run the vcf_to_bq pipeline with --allow_malformed_records, which will skip the malformed ones (just logs them) and load the rest.