1

I am listening to data from pub-sub using streaming data in dataflow. Then I need to upload to storage, process the data and upload it to bigquery.

here is my code:

public class BotPipline {

public static void main(String[] args) {

    DataflowPipelineOptions options = PipelineOptionsFactory.as(DataflowPipelineOptions.class);
    options.setRunner(BlockingDataflowPipelineRunner.class);
    options.setProject(MY_PROJECT);
    options.setStagingLocation(MY_STAGING_LOCATION);
    options.setStreaming(true);

    Pipeline pipeline = Pipeline.create(options);

    PCollection<String> input = pipeline.apply(PubsubIO.Read.maxNumRecords(1).subscription(MY_SUBSCRIBTION));

    input.apply(TextIO.Write.to(MY_STORAGE_LOCATION));

    input
    .apply(someDataProcessing(...)).named("update json"))
    .apply(convertToTableRow(...)).named("convert json to table row"))
            .apply(BigQueryIO.Write.to(MY_BQ_TABLE).withSchema(tableSchema)
    );
    pipeline.run();
}

}

when I run the code commenting the Writing to storage the code works well. but when I try uploading to big query I get this error (which is expected..):

Write can only be applied to a Bounded PCollection

I am not using bound since I need to run this all the time and I need the data to be uploaded immediately . Any solution?

EDIT: this my desired behavior:

I am receiving messages via pubsub. Each message should be stored in its own file in GCS as rough data, execute some processing on the data, and then save it to big query- having the file name in the data.

Data should be seen immediately after received in BQ example :

data published to pubsub : {a:1, b:2} 
data saved to GCS file UUID: A1F432 
data processing :  {a:1, b:2} -> 
                   {a:11, b: 22} -> 
                   {fileName: A1F432, data: {a:11, b: 22}} 
data in BQ : {fileName: A1F432, data: {a:11, b: 22}} 

the idea is that the processed data is stored in BQ having a link to the Rough data stored in GCS

dina
  • 4,039
  • 6
  • 39
  • 67
  • Possible duplicate of [Can TextIO write to prefixes derived from the window maxTimestamp?](https://stackoverflow.com/questions/33522178/can-textio-write-to-prefixes-derived-from-the-window-maxtimestamp) – jkff Jul 12 '17 at 04:08

1 Answers1

2

Currently we don't support writing unbounded collections in TextIO.Write. See related question.

Could you clarify what you would like the behavior of unbounded TextIO.Write to be? E.g. would you like to have one constantly growing file, or one file per window, closed when the window closes, or something else, or does it only matter to you that the total contents of the files written will eventually contain all the PubSub messages but it doesn't matter how the files are structured, etc?

As a workaround, you can implement writing to GCS as your own DoFn, using IOChannelFactory to interact with GCS (in fact, TextIO.Write is, under the hood, just a composite transform that a user could have written themselves from scratch).

You can access the window of the data using the optional BoundedWindow parameter on @ProcessElement. I'd be able to provide more advice if you explain the desired behavior.

Community
  • 1
  • 1
jkff
  • 17,623
  • 5
  • 53
  • 85
  • I am receiving messages via pubsub. I need each message to be saved in its own file in GCS, have some processing on the data, and then save it to big query- having the file name in the data. – dina Nov 03 '16 at 06:42
  • Data should be seen immediately after received in BQ **example** : data published to pubsub : `{a:1, b:2}` data saved to GCS `file UUID: A1F432` data processing : ` {a:1, b:2} ` -> `{a:11, b: 22}` -> `{fileName: A1F432, data: {a:11, b: 22}} ` data in BQ : `{fileName: A1F432, data: {a:11, b: 22}} ` – dina Nov 03 '16 at 06:51
  • the idea is that the processed data is stored in BQ having a ling to the Rough data stored in GCS – dina Nov 03 '16 at 08:44
  • Based on what you're saying, you definitely need to implement this as your own `DoFn`, which should be straightforward using `IOChannelFactory`. The `DoFn` will, in `@ProcessElement`, use `IOChannelFactory` to create, write and close the respective file on `GCS`. Let me know if you need further help here! – jkff Nov 03 '16 at 15:01
  • Can you please provide me the desired code? please see my [question](http://stackoverflow.com/questions/40402150/creating-a-custom-sink-in-data-flow) Somethink like: input.apply(ParDo.of(new DoFn(){ @Override public void processElement(DoFn.ProcessContext c) throws Exception { // TODO the code } })); – dina Nov 03 '16 at 15:07
  • please see my [question](http://stackoverflow.com/questions/40402150/creating-a-custom-sink-in-data-flow) – dina Nov 03 '16 at 15:08