0

Suppose I have the following data in a csv file:

'"tom","jones","hello,\nMy name is tom"\x01\n"sarah","smith","hello"\x01\n'

The line terminator is \x01\n. Would it be possible to load this directly into GCS (without first pre-formatting it)? My thinking process was to do:

  • Load it into CSV using a non-standard separator (such as \x00ff) to get all data on one line.
  • Then do basic DMLs to 'clean up' the data and reformat it.

However, we run into an issue when we have run-on lines, as BQ doesn't 'support' (if you want to call it that) row ordering. Here is what my data looks like in BQ now:

enter image description here

As we can see the line orderings don't work, so it'd be impossible to 'recombine the data', for example, with a UDF to get the correct csv data we need.

Are there any other possible approaches here? Just to clarify, I'm looking to transform a CSV file that is already on GCS via BigQuery, without having to download that file to a separate server for processing before loading into BQ.


For reference, here is the code I'm currently using:

# /tmp/schema_external_nonstandard_csv.json
{
  "schema": {
    "fields": [
      {
        "name": "data",
        "type": "STRING"
      }
    ]
  },
  "sourceFormat": "CSV",
  "sourceUris": [
    "gs://XY-bq/ns.csv"
  ],
  "csvOptions": {
    "fieldDelimiter": "\u00ff",
    "quote": ""
  },
  "maxBadRecords": 1000000
}

$ bq mk --external_table_definition=/tmp/schema_external_nonstandard_csv.json datadocs-163219:bqtesting.ns
$ bq query --nouse_legacy_sql 'CREATE TABLE `XY-163219.bqtesting.ns1` AS select * from `XY-163219.bqtesting.ns`'
Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • Make sure to respond to the answer on your first question. – Elliott Brossard Feb 11 '19 at 21:19
  • @ElliottBrossard could you please clarify? I answered it. –  Feb 11 '19 at 21:22
  • Ah, sorry ha ha. I didn't realize that you had answered your own question :) Disregard. – Elliott Brossard Feb 11 '19 at 22:07
  • how large/small your files? – Mikhail Berlyant Feb 12 '19 at 02:09
  • @MikhailBerlyant they could be any size from a few lines to 100GB or so. Really, it just depends. –  Feb 12 '19 at 02:29
  • I think the problem is that you have disabled `quote` which means that newlines anywhere in your data will be interpreted as EOL. – John Hanley Feb 14 '19 at 22:46
  • @DavidL, I think you should try to explain what is the expected output results and add the queries that you are using to format the data, there are a few options doing this, however I can say in advance that none of them will perform with big files containing single row – hlagos Feb 15 '19 at 02:08
  • it would help a lot if you could share some sample files – Felipe Hoffa Feb 15 '19 at 23:34
  • Would you consider using Dataprep or Dataflow (i.e. [test](https://gist.github.com/gxercavins/4a6508501267d63854697767787d88f8), [result](https://i.stack.imgur.com/kbFjq.png)) as a "middle man"? In any case, as @hlagos says, you'd find problems for 100GB single-row CSVs either due to row limit in BigQuery or unsplittable steps in Dataflow. – Guillem Xercavins Feb 17 '19 at 21:42
  • @GuillemXercavins that seems like a great plan -- you can post that in an answer and I'll accept it. –  Feb 18 '19 at 08:20

1 Answers1

1

I thought about some pure BigQuery solutions:

  1. Specifying delimiter with bq. This does not work for this use case as documented here

"The separator can be any ISO-8859-1 single-byte character."

  1. Using REGEXP_REPLACE the best I got is a single row with a line break inside:
CREATE OR REPLACE TABLE test.separator_final AS
SELECT
  REGEXP_REPLACE(data, r"\\x01\\n", "\n") AS data
FROM
  test.separator_external

enter image description here

  1. Following from the previous point it is possible to use a 'hack' to explode the line into different rows (see answer here). However, the caveat is that you need to know the number of splits a priori and in this case it is not consistent.

  2. The one you are already using but adding a row number so that data can be merged back. This might work but ensuring that row order is preserved can be complicated too.

If we consider Using other GCP products as the middle-man between GCS and BigQuery we can find other interesting solutions:

  1. Using Dataprep, which runs Dataflow under the hood. There is a replace transform (docs) and Dataflow templates can be generated and invoked programmatically.

  2. Using Dataflow. I actually tested this solution with this gist and it works: I think this one can be extended very nicely by creating a template (the custom separator could be an input parameter) and trigger it each time you upload a file to GCS with Cloud Functions (NoOps solution).

Briefly, we read the records from the file using TextIO.read().from(file) where file is the GCS path (provide input and output parameters when launching the job). We can use additionally a dummy delimiter using withDelimiter() to avoid conflicts (here we are limited again to single bytes so we can't directly pass the real one). Then for each line we split by the real delimiter with c.element().split("\\\\x01\\\\n"). Note that we need to escape already-escaped characters (you can verify that in the JSON query results with a normal load), hence the quadruple backslashes.

p
    .apply("GetMessages", TextIO.read().from(file))
        .apply("ExtractRows", ParDo.of(new DoFn<String, String>() {
        @ProcessElement
        public void processElement(ProcessContext c) {
          for (String line : c.element().split("\\\\x01\\\\n")) {
            if (!line.isEmpty()) {
              c.output(line);
            }
          }
        }
    }))

Results:

enter image description here

Keep in mind that, as pointed out by @hlagos, you can run into problems for very large single-row CSVs either due to row limit in BigQuery or unsplittable steps assigned to a single worker in Dataflow.

Guillem Xercavins
  • 6,938
  • 1
  • 16
  • 35
  • this is a great approach thanks for sharing. I've tested out using CloudFunctions as a serverless approach with the processing, but the timeout limit of ~10m makes it impossible to use on larger files. I would say it can consistently work on smaller sources (perhaps under 10 or 100M rows) but for larger ones it's likely to get a timeout. –  Feb 18 '19 at 19:23
  • My idea would be to use the Cloud Function only to trigger the Dataflow job (using templates) instead of doing the actual processing to avoid exceeding the timeout. See an example [here](https://shinesolutions.com/2017/03/23/triggering-dataflow-pipelines-with-cloud-functions/) – Guillem Xercavins Feb 19 '19 at 09:52