0

In my usecase getting set of matching filepattern from Kafka,

PCollection<String> filepatterns = p.apply(KafkaIO.read()...);

Here each pattern could match upto 300+ files.

Q1. How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().

Q2. If path via FileIO.Match->FileIO.ReadMatch()->TextIO.ReadFiles() is used, withHintMatchesManyFiles() isn't available in this path, how it will impact the read performance?

Q3. Is there any other optimized path for above usecase?

Kenn Knowles
  • 5,838
  • 18
  • 22
  • What are you trying to achieve? What's in `filepatterns` collection? Trying to get rid of this Kafka dependency if not really needed to reproduce the issue. – Jacek Laskowski Jun 01 '20 at 11:11
  • 1
    Kafka is not a dependency here. Idea is to read multiple filepatterns from PCollection which is populated from some other stream. To remove stream dependency try it with `PCollection filepatterns = Create.of("file://sample/20-01-20/24/*.zip")` – Prakhar Mishra Jun 01 '20 at 18:38

2 Answers2

1

Yes, you can't have withHintMatchesManyFiles() with TextIO.ReadFiles() out of the box. Actually, TextIO.Read().withHintMatchesManyFiles() is implemented via FileIO transforms + TextIO.ReadFiles() (see details). In this way, FileIO.readMatches() should distribute the files reading over the workers.

So, I think you can use the same approach while reading file names from Kafka topic.

Alexey Romanenko
  • 1,353
  • 5
  • 11
0

How can I use TextIO.Read() to match data from PCollection, as withHintMatchesManyFiles() available only for TextIO.Read() not for TextIO.ReadFiles().

My very limited understanding of Apache Beam in general and PTransforms in particular is that TextIO.read() creates a root PTransform that can only be used at the very beginning of the pipeline. In other words, TextIO.Read cannot be used after a PTransform of any kind.

Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420