Header Row Insert

Question

I am trying to insert data to big query using google cloud dataprep, I did create recipe and add first row as header row, but when I am trying to run on multiple files it insert the header row to my big query table also. Anybody facing this problem ?

score 0 · Answer 1 · answered Apr 17 '19 at 18:54

Welcome to StackOverflow, Andy!

I think I'm correctly understanding your problem, but I want to make sure since I'm making some assumptions:

You have multiple files imported in Dataprep
You created a recipe for the first file and convert row 1 to be a header
You apply a UNION step to merge the additional files
Your output contains the header rows for the additional files

If that's correct, the issue is that the header rows in the other files aren't being removed simply because Dataprep doesn't know what they are. In most cases, Dataprep will detect the file structure and you won't have to manually specify the header row. When that fails, however, UNION steps get a little funny like this—but you can definitely fix it in Dataprep.

Workarounds:

Apply a Recipe to Each Input File

Simply add a recipe to each file that converts the first row to a header—then instead of selecting your original file in the main recipe's UNION, select the other recipes (Dataprep will run them before merging the data).

While this takes some extra effort, it's doable for a small number of files. The advantage here is that you don't have to worry about whether your data may contain the header value—but I'd recommend using the other option if you're able to.

Use a Custom Filter Formula to Delete All Header Rows

The other option is a bit more dependent on your data, but lets you do everything in the main recipe. For example, after setting headers from the first file and applying your UNION you would add a "Filter rows using custom formula" step (or clicking Filter Rows > On column values > Custom Filter...), then match using a column that wouldn't contain the header string (e.g. CustomerID == "CustomerID")—integer columns work great since you don't have to worry if the value could contain the header string. The Resulting wrangle script should look something like this:

header sourcerownumber: 1
[union step goes here]
filter type: custom rowType: single row: CustomerID == 'CustomerID' action: Delete

Note: You may be tempted to do this by using $sourcerownumber, but that doesn't exist due to the union. I'm hoping that they'll eventually support it for this use case though.

These aren't the only ways you could eliminate the headers, but should provide two easy options for you.

As a pro-tip, you can copy a line of the wrangle script above and paste it after clicking "New Step" in your recipe and it'll set up the filter the same way that I did so you don't have to start from scratch. Just change the column name/value and you should be good to go.

Again, welcome to the site—and if any of the assumptions above are incorrect, update your original question with the additional details and let me know in a comment and I'll be happy help you out further.

Header Row Insert

1 Answers1

Workarounds:

Apply a Recipe to Each Input File

Use a Custom Filter Formula to Delete All Header Rows