0

I have a use case where I'm reading around billions of records, but I need to limit the record to see the data behaviour. I have a pardo where I'm analysing the limited data and performing some functionality based on that. But I'm reading entire billion records and then applying limit inside Pardo to get 10000 records. Since my pipeline is reading billion records, it hampers the pipeline performance. Is there any way I could just limit the records, while reading text file using TextIO.

miles212
  • 383
  • 3
  • 20
  • Can you provide more information? Which language are you using? Which method of beam are you using to read? – rmesteves Feb 27 '20 at 12:16
  • @rmesteves I'm using cloud dataflow using java. I'm using TextIO.read method to read data from GCS. – miles212 Mar 07 '20 at 15:27

2 Answers2

0

Where are you reading the records from? I think the answer depends on that.

If they all come from, e.g. a same file then I don't think Beam supports sampling a part of them. If they are, e.g. from different files, maybe you can design the file matching pattern you use such that you only read some of them?

Yueyang Qiu
  • 159
  • 5
  • Yeah I'm using file matching for segregated files, but in some cases i have unshard files. I'm trying to do sampling in unshard files. – miles212 Feb 25 '20 at 06:06
0

You might have to try using a Sample transform, like Sample.any(10000). Perhaps, it will work faster.

Alexey Romanenko
  • 1,353
  • 5
  • 11
  • This method works on input PCollection, for that I have to read all the files and then I can apply Sample.any, So this is what I'm doing currently. Instead of reading all the files and I want it to be done while reading in TextIO. – miles212 Mar 07 '20 at 15:32
  • I don't think it's possible with TextIO. Probably, it will be easier to pre-process your files (choose which files to read) depending on your data knowledge and distribution. – Alexey Romanenko Mar 09 '20 at 16:40