Writing to BigQuery from Cloud Dataflow: Unable to create a side-input view from input

Question

I'm trying to write a datastore flow that reads in a stream for pub sub and writes in into big query.

When trying to run the tool I get the error " Unable to create a side-input view from input" with the stack trace:

Exception in thread "main" java.lang.IllegalStateException: Unable to create a side-input view from input
at com.google.cloud.dataflow.sdk.transforms.View$AsIterable.validate(View.java:277)
at com.google.cloud.dataflow.sdk.transforms.View$AsIterable.validate(View.java:268)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:366)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:274)
at com.google.cloud.dataflow.sdk.values.PCollection.apply(PCollection.java:161)
at com.google.cloud.dataflow.sdk.io.Write$Bound.createWrite(Write.java:214)
at com.google.cloud.dataflow.sdk.io.Write$Bound.apply(Write.java:79)
at com.google.cloud.dataflow.sdk.io.Write$Bound.apply(Write.java:68)
at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:74)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.apply(DirectPipelineRunner.java:247)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:367)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:290)
at com.google.cloud.dataflow.sdk.values.PCollection.apply(PCollection.java:174)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$Bound.apply(BigQueryIO.java:1738)
at com.google.cloud.dataflow.sdk.io.BigQueryIO$Write$Bound.apply(BigQueryIO.java:1440)
at com.google.cloud.dataflow.sdk.runners.PipelineRunner.apply(PipelineRunner.java:74)
at com.google.cloud.dataflow.sdk.runners.DirectPipelineRunner.apply(DirectPipelineRunner.java:247)
at com.google.cloud.dataflow.sdk.Pipeline.applyInternal(Pipeline.java:367)
at com.google.cloud.dataflow.sdk.Pipeline.applyTransform(Pipeline.java:274)
at com.google.cloud.dataflow.sdk.values.PCollection.apply(PCollection.java:161)
at co.uk.bubblestudent.dataflow.StarterPipeline.main(StarterPipeline.java:116)
Caused by: java.lang.IllegalStateException: GroupByKey cannot be applied to non-bounded PCollection in the GlobalWindow without a trigger. Use a Window.into or Window.triggering transform prior to GroupByKey.
at com.google.cloud.dataflow.sdk.transforms.GroupByKey.applicableTo(GroupByKey.java:192)
at com.google.cloud.dataflow.sdk.transforms.View$AsIterable.validate(View.java:275)
... 20 more

My code is:

public class StarterPipeline {


public static final Duration ONE_DAY = Duration.standardDays(1);
public static final Duration ONE_HOUR = Duration.standardHours(1);
public static final Duration TEN_SECONDS = Duration.standardSeconds(10);
private static final Logger LOG = LoggerFactory.getLogger(StarterPipeline.class);

  private static TableSchema schemaGen() {
    List<TableFieldSchema> fields = new ArrayList<>();
    fields.add(new TableFieldSchema().setName("facebookID").setType("STRING"));
    fields.add(new TableFieldSchema().setName("propertyID").setType("STRING"));
    fields.add(new TableFieldSchema().setName("time").setType("TIMESTAMP"));
    TableSchema schema = new TableSchema().setFields(fields);
    return schema;
  }

  public static void main(String[] args) {
  LOG.info("Starting");
  DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
  LOG.info("Pipeline made");
  // For Cloud execution, set the Cloud Platform project, staging location,
  // and specify DataflowPipelineRunner or BlockingDataflowPipelineRunner.
  options.setProject(<project>);
  options.setStagingLocation(<bucket>);
  options.setTempLocation(<bucket>);
  Pipeline p = Pipeline.create(options);


  TableSchema schema = schemaGen();
  LOG.info("Schema made");
  try {
    LOG.info(schema.toPrettyString());
} catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
}
  PCollection<String> input = p.apply(PubsubIO.Read.named("ReadFromPubsub").subscription(<subscription>));

  PCollection<TableRow> pardo = input.apply(ParDo.of(new FormatAsTableRowFn()));
  LOG.info("Formatted Row");

  pardo.apply(BigQueryIO.Write.named("Write into BigQuery").to(<table>)
       .withSchema(schema)
       .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
       .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
  LOG.info("about to run");
  p.run();

  }


  static class FormatAsTableRowFn extends DoFn<String, TableRow> {
    @Override
    public void processElement(ProcessContext c) {
        LOG.info("Formatting");
         String json = c.element();

         //HashMap<String,String> items = new Gson().fromJson(json, new TypeToken<HashMap<String, String>>(){}.getType());

        // Make a BigQuery row from the JSON object:
        TableRow row = new TableRow()
            .set("facebookID","324234")
            .set("properttyID", "23423")
            .set("time", "12312313123");


       /*
        *     TableRow row = new TableRow()
            .set("facebookID", items.get("facbookID"))
            .set("properttyID", items.get("propertyID"))
            .set("time", items.get("time"));
        */
        c.output(row);
    }
  }
}

Any suggestions on what this might be?

I don't believe there was a 1.1.2 release. Dataflow is now up to 1.6.0; can you try with that? — danielm, Aug 03 '16 at 20:18
I believe 1.6.0 is the Java SDK. If you look at https://cloud.google.com/dataflow/release-notes/eclipse the eclipse version is on 1.1.2 — Adam Brocklehurst, Aug 03 '16 at 20:21
For googlers, if your bigquery write stopped but dataflow works as usual, you might have uncaught exceptions in pipeline and due to that bigquery stops writing new records to db but does not show any additional exception. I thought if i get an exception on one record it would not break whole pipeline, but it is. — halil, May 19 '17 at 17:02

score 1 · Accepted Answer · answered Aug 03 '16 at 21:38

1

The default implementation of BigQueryIO only works over bounded PCollections, and PubsubIO.Read produces an unbounded PCollection.

There are two ways to fix this: you can bound the input by calling maxReadTime or maxNumElements on your PubsubIO transform, or you can use the streaming insert type of BigQueryIO by calling setStreaming(true) on your options.

answered Aug 03 '16 at 21:38

danielm

3,000
10
15

Ah okay, thanks for the answer! I'll check this in the office tomorrow morning. – Adam Brocklehurst Aug 03 '16 at 21:40
That worked and my project started to run, however it started to output `Aug 04, 2016 9:47:55 AM com.google.api.client.http.HttpRequest execute WARNING: exception thrown while executing request java.net.SocketTimeoutException: Read timed out` every 20 seconds or so. I looked this up and it seems like you have to edit the socket timeout time? However some simple examples don't do this and I assume must work - https://github.com/GoogleCloudPlatform/DataflowJavaSDK/blob/master/examples/src/main/java/com/google/cloud/dataflow/examples/complete/TrafficRoutes.java – Adam Brocklehurst Aug 04 '16 at 09:06
Can you give a bit more context on where that exception is coming from? In general you shouldn't need to mess with timeouts, although if you are running locally you may be using a slower or higher latency network. – danielm Aug 04 '16 at 18:13
It looks like those errors should be retried automatically; is this causing your pipeline to get stuck, or just have spammy logs? In either case, setting a higher timeout should solve the problem. Most of the examples are intended primarily to run against the Dataflow service, and so may have some logspam when run locally – danielm Aug 05 '16 at 17:32
Yeah, it was just spamming the logs and was working. Thanks for your help! :) – Adam Brocklehurst Aug 08 '16 at 10:44

Writing to BigQuery from Cloud Dataflow: Unable to create a side-input view from input

1 Answers1