Is there any way I can limit record while performing TextIO?

Question

I have a use case where I'm reading around billions of records, but I need to limit the record to see the data behaviour. I have a pardo where I'm analysing the limited data and performing some functionality based on that. But I'm reading entire billion records and then applying limit inside Pardo to get 10000 records. Since my pipeline is reading billion records, it hampers the pipeline performance. Is there any way I could just limit the records, while reading text file using TextIO.

Can you provide more information? Which language are you using? Which method of beam are you using to read? — rmesteves, Feb 27 '20 at 12:16
@rmesteves I'm using cloud dataflow using java. I'm using TextIO.read method to read data from GCS. — miles212, Mar 07 '20 at 15:27

score 0 · Answer 1 · answered Feb 25 '20 at 05:42

0

Where are you reading the records from? I think the answer depends on that.

If they all come from, e.g. a same file then I don't think Beam supports sampling a part of them. If they are, e.g. from different files, maybe you can design the file matching pattern you use such that you only read some of them?

answered Feb 25 '20 at 05:42

Yueyang Qiu

159
5

Yeah I'm using file matching for segregated files, but in some cases i have unshard files. I'm trying to do sampling in unshard files. – miles212 Feb 25 '20 at 06:06

score 0 · Answer 2 · answered Feb 27 '20 at 11:44

0

You might have to try using a Sample transform, like Sample.any(10000). Perhaps, it will work faster.

answered Feb 27 '20 at 11:44

Alexey Romanenko

1,353
5
11

This method works on input PCollection, for that I have to read all the files and then I can apply Sample.any, So this is what I'm doing currently. Instead of reading all the files and I want it to be done while reading in TextIO. – miles212 Mar 07 '20 at 15:32
I don't think it's possible with TextIO. Probably, it will be easier to pre-process your files (choose which files to read) depending on your data knowledge and distribution. – Alexey Romanenko Mar 09 '20 at 16:40

Is there any way I can limit record while performing TextIO?

2 Answers2