How to add record numbers to TextIO file sources in Apache Beam or Dataflow

Asked Feb 17 '17 at 16:25

Active Oct 16 '20 at 10:45

Viewed 238 times

I am using Dataflow (and now Beam) to process legacy text files to replicate the transformations of an existing ETL tool. The current process adds a record number (the record number for each row within each file) and the filename. The reason they want to keep this additional info is so that they can tell which file and record offset the source data came from.

I want to get to a point where I have a PCollection which contains File record number and filename as additional fields in the value or part of the key.

I've seen a different article where the filename can be populated into the resulting PCollection, however I do not have a solution for adding the record numbers per row. Currently the only way I can do it is to pre-process the files before I start the Dataflow process (which is a shame since I would want to have Dataflow/Beam to do it all)

edited Oct 16 '20 at 10:45

Vadim Kotov

8,084
8
48
62

asked Feb 17 '17 at 16:25

Anant Mistry

How to add record numbers to TextIO file sources in Apache Beam or Dataflow

0 Answers0