skip header while reading a CSV file in Apache Beam

Question

I want to skip header line from a CSV file. As of now I'm removing the header manually before loading it to google storage.

Below is my code :

PCollection<String> financeobj =p.apply(TextIO.read().from("gs://storage_path/Financials.csv"));        
    PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype

        private static final long serialVersionUID = 1L;
        @ProcessElement
        public void processElement(ProcessContext c) {
            String[] strArr = c.element().split(",");
            ClassFinance fin = new ClassFinance();
            fin.setBeneficiaryFinance(strArr[0]);
            fin.setCatlibCode(strArr[1]);
            fin.set_rNR_(Double.valueOf(strArr[2]));
            fin.set_rNCS_(Double.valueOf(strArr[3]));
            fin.set_rCtb_(Double.valueOf(strArr[4]));
            fin.set_rAC_(Double.valueOf(strArr[5]));
            c.output(fin);
        }
    }));

I have checked the existing question in stackoverflow but I dont find it promising : Skipping header rows - is it possible with Cloud DataFlow?

Any help ?

Edit : I have tried something like below and it worked :

PCollection<String> financeobj = p.apply(TextIO.read().from("gs://google-bucket/final_input/Financials123.csv"));       

    PCollection<ClassFinance> pojos5 = financeobj.apply(ParDo.of(new DoFn<String, ClassFinance>() { // converting String into classtype

        private static final long serialVersionUID = 1L;
        @ProcessElement
        public void processElement(ProcessContext c) {  
            String[] strArr2 = c.element().split(",");
            String header = Arrays.toString(strArr2);
            ClassFinance fin = new ClassFinance();

                if(header.contains("Beneficiary"))
                System.out.println("Header");
                else {
            fin.setBeneficiaryFinance(strArr2[0].trim());
            fin.setCatlibCode(strArr2[1].trim());
            fin.setrNR(Double.valueOf(strArr2[2].trim().replace("", "0")));
            fin.setrNCS(Double.valueOf(strArr2[3].trim().replace("", "0")));
            fin.setrCtb(Double.valueOf(strArr2[4].trim().replace("", "0")));
            fin.setrAC(Double.valueOf(strArr2[5].trim().replace("", "0")));
            c.output(fin);
            }
        }
    }));

dsesto · Accepted Answer · 2018-06-18T15:51:20.263

The older Stack Overflow post that you shared (Skipping header rows - is it possible with Cloud DataFlow?) does contain the answer to your question.

This option is currently not available in the Apache Beam SDK, although there is an open Feature Request in the Apache Beam JIRA issue tracker, BEAM-123. Note that, as of writing, this feature request is still open and unresolved, and it has been like that for 2 years already. However, it looks like some effort is being done in that sense, and the latest update in the issue is from February 2018, so I would advise you to stay updated on that JIRA issue, as it was last moved to the sdk-java-core component, and it may be getting more attention there.

With that information in mind, I would say that the approach you are using (removing the header before uploading the file to GCS) is the best option for you. I would refrain from doing it manually, as you can easily script that and automate the remove header ⟶ upload file process.

EDIT:

I have been able to come up with a simple filter using a DoFn. It might not be the most elegant solution (I am not an Apache Beam expert myself), but it does work, and you may be able to adapt it to your needs. It requires that you know beforehand the header of the CSV files being uploaded (as it will be filtering by element content), but again, take this just as a template that you may be able to modify to your needs:

public class RemoveCSVHeader {
  // The Filter class
  static class FilterCSVHeaderFn extends DoFn<String, String> {
    String headerFilter;

    public FilterCSVHeaderFn(String headerFilter) {
      this.headerFilter = headerFilter;
    }

    @ProcessElement
    public void processElement(ProcessContext c) {
      String row = c.element();
      // Filter out elements that match the header
      if (!row.equals(this.headerFilter)) {
        c.output(row);
      }
    }
  }

  // The main class
  public static void main(String[] args) throws IOException {
    PipelineOptions options = PipelineOptionsFactory.create();
    Pipeline p = Pipeline.create(options);

    PCollection<String> vals = p.apply(TextIO.read().from("gs://BUCKET/FILE.csv"));

    String header = "col1,col2,col3,col4";

    vals.apply(ParDo.of(new FilterCSVHeaderFn(header)))
        .apply(TextIO.write().to("out"));

    p.run().waitUntilFinish();
  }
}

That means I have to write a DoFn which will take one row at a time as input and it blocks the first row(header) . Right ? — Nagesh Singh Chauhan, Jun 14 '18 at 12:28
Yes, you can do something similar to what was suggested in the linked question. Of course, this would depend on your implementation, but I would say that if you can directly upload the files without the headers, that would be more optimal than "forcing" Dataflow to discard the header in a more manual way. — dsesto, Jun 14 '18 at 12:41
I agree with you... But the clients will just upload the file in storage and run the code, they will not be removing the header first. That's the reason I want to automate this process. What I thought of, I'll Blob to read the file, remove header and then upload it back. After that I'll read the same file using pcollection ? — Nagesh Singh Chauhan, Jun 14 '18 at 12:53
Hmmm, that's a trickier situation, I guess. Either you do that in Dataflow, or maybe you can implement a *not-so-difficult* solution involving Cloud Functions. You can code your own [GCS-triggered Cloud Function](https://cloud.google.com/functions/docs/calling/storage), which executes every time a new object is uploaded to GCS, and what it does is read the GCS file and remove the header, then write the result back to a different bucket (which does not trigger the Cloud Function). This is just an idea, but you may find a Dataflow solution that matches your scenario better. — dsesto, Jun 14 '18 at 14:47
Thanks Dsesto. I'll try to implement it using a DoFn. I have updated the code, can you please have a look.? — Nagesh Singh Chauhan, Jun 15 '18 at 11:23
@NageshSinghChauhan I have updated my answer with a sample `DoFn` filter that can be used to remove the (know beforehand) header of your CSV files. It works as it is, but you may have to adapt it to your specific use case. — dsesto, Jun 18 '18 at 15:53
I have implemented something like what you updated and it worked. Thanks a lot man... — Nagesh Singh Chauhan, Jun 20 '18 at 10:46

score 2 · Answer 2 · answered Feb 19 '19 at 18:05

This code works for me. I have used Filter.by() to filter out the header row from csv file.

static void run(GcsToDbOptions options) {

Pipeline p = Pipeline.create(options);
// Read the CSV file from GCS input file path
p.apply("Read Rows from " + options.getInputFile(), TextIO.read()
    .from(options.getInputFile()))
    // filter the header row
    .apply("Remove header row",
        Filter.by((String row) -> !((row.startsWith("dwid") || row.startsWith("\"dwid\"")
            || row.startsWith("'dwid'")))))
    // write the rows to database using prepared statement
    .apply("Write to Auths Table in Postgres", JdbcIO.<String>write()
        .withDataSourceConfiguration(JdbcIO.DataSourceConfiguration.create(dataSource(options)))
        .withStatement(INSERT_INTO_MYTABLE)
        .withPreparedStatementSetter(new StatementSetter()));
PipelineResult result = p.run();
try {
  result.getState();
  result.waitUntilFinish();
} catch (UnsupportedOperationException e) {
  // do nothing
} catch (Exception e) {
  e.printStackTrace();
}}

score -3 · Answer 3 · answered Oct 26 '18 at 20:31

https://medium.com/@baranitharan/the-textio-write-1be1c07fbef0 The TextIO.Write in Dataflow now has withHeader function to add a header row to the data. This function was added in verison 1.7.0.

So you can add a header to your csv like this:

TextIO.Write.named("WriteToText")
            .to("/path/to/the/file")
            .withHeader("col_name1,col_name2,col_name3,col_name4")
            .withSuffix(".csv"));

The withHeader function automatically adds a newline character at the end of the header row.

skip header while reading a CSV file in Apache Beam

3 Answers3