3

I am creating a demo pipeline to load a CSV file into BigQuery with Dataflow using my free google account. This is what I am facing.

When I read from a GCS file and just log the data, this works perfectly. below is my sample code.

This code runs okay

DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
options.setProject("project12345");
options.setStagingLocation("gs://mybucket/staging");
options.setRunner(DataflowRunner.class);
DataflowRunner.fromOptions(options);
Pipeline p = Pipeline.create(options);
p.apply(TextIO.read().from("gs://mybucket/charges.csv")).apply(ParDo.of(new DoFn<String, Void>() {
            @ProcessElement
            public void processElement(ProcessContext c) {
                LOG.info(c.element());
            }

}));

However, when I add a temp folder location with a path to a bucket I created, I get an error, below is my code.


        LOG.debug("Starting Pipeline");
        DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
        options.setProject("project12345");
        options.setStagingLocation("gs://mybucket/staging");
        options.setTempLocation("gs://project12345/temp");
        options.setJobName("csvtobq");
        options.setRunner(DataflowRunner.class);
    
        DataflowRunner.fromOptions(options);
        Pipeline p = Pipeline.create(options);

        boolean isStreaming = false;
        TableReference tableRef = new TableReference();
        tableRef.setProjectId("project12345");
        tableRef.setDatasetId("charges_data");
        tableRef.setTableId("charges_data_id");

        p.apply("Loading Data from GCS", TextIO.read().from("gs://mybucket/charges.csv"))
                .apply("Convert to BiqQuery Table Row", ParDo.of(new FormatForBigquery()))
                .apply("Write into Data in to Big Query",
                        BigQueryIO.writeTableRows().to(tableRef).withSchema(FormatForBigquery.getSchema())
                                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
                                .withWriteDisposition(isStreaming ? BigQueryIO.Write.WriteDisposition.WRITE_APPEND
                                        : BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE));

        p.run().waitUntilFinish();
    } 

When I run this, I get the following error

Exception in thread "main" java.lang.IllegalArgumentException: DataflowRunner requires gcpTempLocation, but failed to retrieve a value from PipelineOptions
    at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:242)
    at demobigquery.StarterPipeline.main(StarterPipeline.java:74)
Caused by: java.lang.IllegalArgumentException: Error constructing default value for gcpTempLocation: tempLocation is not a valid GCS path, gs://project12345/temp. 
    at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:247)
    at org.apache.beam.sdk.extensions.gcp.options.GcpOptions$GcpTempLocationFactory.create(GcpOptions.java:228)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.returnDefaultHelper(ProxyInvocationHandler.java:592)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.getDefault(ProxyInvocationHandler.java:533)
    at org.apache.beam.sdk.options.ProxyInvocationHandler.invoke(ProxyInvocationHandler.java:155)
    at com.sun.proxy.$Proxy15.getGcpTempLocation(Unknown Source)
    at org.apache.beam.runners.dataflow.DataflowRunner.fromOptions(DataflowRunner.java:240)

Is this an issue with authentication?, because I am using JSON credentials as project owner from GCP via Eclipse Dataflow plugin.

Any help would be highly appreciated.

IsaacK
  • 1,178
  • 1
  • 19
  • 49
  • Is you tempLocation a valid GCS URI? https://beam.apache.org/documentation/runners/dataflow/#pipeline-options – Christopher Jul 09 '19 at 11:08
  • A possible duplicate of your issue, although it's not clear why it was an authentication related issue. https://stackoverflow.com/questions/43026371/apache-beam-minimalwordcount-example-with-dataflow-runner-on-eclipse/43026561 – Christopher Jul 09 '19 at 11:09
  • Its a valid URL, I can browse to the bucket I specified. – IsaacK Jul 09 '19 at 11:32

4 Answers4

1

Looks like the error message thrown from[1]. The default GCS validator is implemented in[2]. As you can see Beam code also throws cause exception for the IllegalArgumentException. So you can check a stack further for an exception happened in GcsPathValidator.

[1] https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/options/GcpOptions.java#L278

[2]https://github.com/apache/beam/blob/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp/storage/GcsPathValidator.java#L29

Rui Wang
  • 789
  • 6
  • 11
1

There could be multiple reasons for this:

  1. You are not logged in with the right GCP project credentials - Either the wrong user (or there is no logged in user) or the wrong project is being logged into

    Ensure that the GOOGLE_APPLICATION_CREDENTIALS environment variable is for the right user and project. If not obtain the right credentials using

    gcloud auth application-default login

    Download the json, and change the GOOGLE_APPLICATION_CREDENTIALS to the downloaded file. Restart your system and then try again

  2. You could be logging into the right project with the right user ID, but the requisite permissions for bucket access might be absent. Ensure that you have the following accesses:

    • Storage Admin
    • Storage Legacy Bucket Owner
    • Storage Legacy Object Owner (Optional)
  3. The URL you are trying does not exist or is misspelt

Nishant
  • 395
  • 4
  • 6
0

It can be related to the streaming option that you are setting. CSV uploads are automatically set as batch jobs. Hence if you are trying to set it as stream it can cause problems.

If you insist on streaming, check out this documentation.

Sevki Baba
  • 336
  • 1
  • 8
0

Perhaps you are missing credentials? Since you need to create a folder structure (one new folder for each execution), you would need Storage Admin and not just Storage Object Admin or Storage Object Creator.

Andreas L
  • 81
  • 1
  • 7