Skipping header rows - is it possible with Cloud DataFlow?

Question

I've created a Pipeline, which reads from a file in GCS, transforms it, and finally writes to a BQ table. The file contains a header row (fields).

Is there any way to programatically set the "number of header rows to skip" like you can do in BQ when loading in?

number of header rows to skip

Sam McVeety · Accepted Answer · 2017-04-30T20:14:04.247

4

This is not currently possible. It sounds like there are two potential requests here:

Also, in the meantime, you could add a simple filter to your ParDo code to skip headers. Something like this:

PCollection<X> rows = ...;
PCollection<X> nonHeaders =
   rows.apply(Filter.by(new MatchIfNonHeader()));

edited Apr 30 '17 at 20:14

answered Feb 11 '15 at 17:36

Sam McVeety

Is there some kind of filter component that I can apply? Or do you just mean skipping the header in the actual "processElement" method of my ParDo code by checking if it's the header? – Graham Polley Feb 12 '15 at 01:55
one way could be start processing in PerDO and check if that line contains header because if does than skip it – Amit_Hora Jan 12 '17 at 06:02
was this ever addressed as a feature? – CCC Apr 29 '17 at 21:07

1 Answers1