2

I've been used Google Cloud Storage Plugin under the Sink category in a pipeline, to get the output in CSV format. After the execution of the pipeline, the resulted output is coming in several files after splitting up. Is it the right behaviour of this plugin? if it is, then Is there a way to get the consolidated output in a single file?

Edited: It seems it is the kind of right behaviour of the plugin, mentioned in https://cloud.google.com/storage/docs/composite-objects. sharding is done to support parallel uploads. but now my question is, Is there is a simple way to compose all these splitted files?

1 Answers1

1

Multiple files are found in the output directory is the expected behavior, as Cloud Data Fusion uses Spark/MapReduce underneath to parallelize execution of the pipeline logic.

When combining output files back to one, do you have any requirements about ordering?

Terence Yim
  • 134
  • 5
  • I believe it is a pretty basic question. what we gonna do with this fragmented output file. Sure i need it to do some further processing or I wanna see it on some other tool. Anyway, I did create a composite file with the compose function of Google cloud storage python library, but that's not up to the point. It is creating a malformed JSON. – Ashish Balhara May 02 '19 at 17:06
  • Is the output CSV or JSON? If the output is JSON, the file sink output one Json object per line. – Terence Yim May 03 '19 at 19:49
  • Yes It is a JSON file. The problem of creating malformed JSON, because of its extension as well as its media type. As per the newline JSON specification( defined it in https://github.com/ndjson/ndjson-spec), the extension should be **ndjson** and media type should be **application/x-ndjson**. I was creating with standard JSON ones, that's why the problem I faced. – Ashish Balhara May 07 '19 at 05:55
  • Initially I used csv file as an output, I didn't find the header information in Output file. then I moved to JSON file in output. sorry for confusion. – Ashish Balhara May 07 '19 at 07:16
  • You can use an Action in the pipeline to combine files. Currently Data Fusion don't have a built-in plugin to do that. But you can write one on your own, similar to one of those in https://github.com/data-integrations/google-cloud/tree/develop/src/main/java/io/cdap/plugin/gcp/gcs/actions . – Terence Yim Jul 03 '19 at 16:47