I was trying to read data using wildcard from gcs path. My files is in bzip2 format and there were around 300k files resides in the gcs path with same wildcard expression. I'm using the below code snippet to read files.
PCollection<String> val = p
.apply(FileIO.match()
.filepattern("gcsPath"))
.apply(FileIO.readMatches().withCompression(Compression.BZIP2))
.apply(MapElements.into(TypeDescriptor.of(String.class)).via((ReadableFile f) -> {
try {
return f.readFullyAsUTF8String();
} catch (IOException e) {
return null;
}
}));
But the performance is very bad and it will take around 3 days to read that file using above code with the current speed. Is there any alternative api I can use in cloud dataflow to read this amount of files from gcs with ofcourse good performance. I used TextIO earlier, but it was getting failed because of template serialisation limit which is 20MB.