2

I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.

    PCollection<String> val = p
            .apply(FileIO.match()
                    .filepattern("gcsPath"))
            .apply(FileIO.readMatches().withCompression(Compression.BZIP2))
            .apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
                try {
                    return f.readFullyAsUTF8String();
                } catch (IOException e) {
                    return null;
                }
            }));

But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.

James Z
  • 12,209
  • 10
  • 24
  • 44
miles212
  • 383
  • 3
  • 20
  • What's the total transfer size of 300k files? – Parth Mehta Feb 07 '20 at 09:56
  • Is it running on Dataflow or on your computer? – guillaume blaquiere Feb 07 '20 at 19:34
  • @ParthMehta Total transfer size is around 1 TB. – miles212 Feb 07 '20 at 19:35
  • @guillaumeblaquiere its running on Dataflow 2.17.0 – miles212 Feb 07 '20 at 19:36
  • The release version are strange... The latest on MVN repository in the 2.19, the latest on github is the 2.16-RC1. Can you have a try on this? So, do you see the same issue with you launch the pipeline in direct runner on your computer? Last questions, what is the machine type that you use in your pipeline? What is the scalability? – guillaume blaquiere Feb 07 '20 at 20:43
  • @guillaumeblaquiere, the "latest release" tag on the github repo seems to be outdated, but the releases are present. Right above 2.16 in small letters I see "... show 7 newer tags", and that shows all the way up to 2.19, just like Maven. – Daniel Oliveira Feb 08 '20 at 01:00
  • The stable version 2.18.0 released 2 week earlier. 2.19.0 snapshot is available which is for beta users. – miles212 Feb 08 '20 at 05:18
  • @guillaumeblaquiere Could you please upvote this question, so that someone will try to answer this query. – miles212 Feb 08 '20 at 14:58
  • done.You didn't answer about the machine type and the pipeline scaling. Do you have inputs on this? – guillaume blaquiere Feb 08 '20 at 19:08
  • @guillaumeblaquiere Worker machine-type is n1-standard-16 and I'm using default auto-scaling algorithm. no max-worker set. – miles212 Feb 08 '20 at 19:09
  • Definitively not an issue of bandwidth or processing capability... – guillaume blaquiere Feb 08 '20 at 19:11
  • I don't know why, but FileIO performance is very bad, might be some internal api glitches. – miles212 Feb 08 '20 at 19:13
  • Do you see Dataflow scaling up at all or is reading handled by one or few workers ? If job is not scaling up this could be due to sizes of compressed files being unbalanced. Dataflow cannot split compressed files so each file will be processed by one worker. Regarding template serialization limit, I think this can be avoided by using the dataflowJobFile option: https://cloud.google.com/dataflow/docs/guides/common-errors – chamikara Feb 10 '20 at 21:54
  • 1
    After fixing the template size error as per chamikara comment, if you use TextIO you can also make use of https://beam.apache.org/releases/javadoc/2.19.0/org/apache/beam/sdk/io/TextIO.Read.html#withHintMatchesManyFiles-- – Reza Rokni Feb 11 '20 at 02:15
  • Did you try to follow @chamikara solution, and then exchange reading method to TextIO? – Nick_Kh Feb 11 '20 at 12:34
  • @chamikara I already followed all this steps to minimize the tempate serialization size. But Thanks for your information. – miles212 Feb 11 '20 at 18:01
  • 1
    @RezaRokni I just saw your answer, but I got the same solution yesterday. Thanks for your help. – miles212 Feb 11 '20 at 18:08

1 Answers1

0

Below TextIO() code solved the issue.

PCollection<String> input = p.apply("Read file from GCS",TextIO.read().from(options.getInputFile())
                        .withCompression(Compression.AUTO).withHintMatchesManyFiles()
                        );              

withHintMatchesManyFiles() solved the issue. But still I don't know while FileIO performance is so bad.

miles212
  • 383
  • 3
  • 20
  • I'm not following the solution. In your original post, I see you reading the whole file as an element into the PCollection while in the second solution, I am seeing your read each line as a separate element (String) into the PCollection. – Kolban Aug 06 '22 at 21:14