1

I'm currently trying to implement a number of delta lake tables for storing our 'foundation' data.

These tables will be build from delta data ingested into our 'raw' zone in our data lake in the following format:

  • /raw/MyDataSource/2020/11/23/EntityOne_1056.parquet
  • /raw/MyDataSource/2020/11/23/EntityTwo_1059.parquet
  • /raw/MyDataSource/2020/11/22/EntityOne_0956.parquet

I can go about this using a series of loops and dataframes for reading the existing data and merging it with the new data however I think there is an easier way?

I've discovered the COPY INTO command (https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html) which appears to take the headache of doing the merge and managing which batched files have been processed into the delta lake table, but I'm not sure how to implement this and the only examples I can find are very limited.

  COPY INTO table_identifier
  FROM [ file_location | (SELECT identifier_list FROM file_location) ]
  FILEFORMAT = data_source
  [FILES = [file_name, ... | PATTERN = 'regex_pattern']
  [FORMAT_OPTIONS ('data_source_reader_option' = 'value', ...)]
  [COPY_OPTIONS 'force' = ('false'|'true')]

I think I first need to create a list of the expected entity names, then use that to run a COPY INTO command for each entity/delta table.

I need to make the command look at our 'Raw' zone in the data lake and find all files for that entity (probably using wildcards in the file path or a RegEx pattern). Something like:

"/raw/MyDataSource/*/*/*/{0}_*.parquet".format("EntityOne")

Finally I then need to call the command query using either a spark.sql(..) command or by using a SQL cell in my Databricks notebook.

My questions:

  1. Have I understood what the COPY INTO command can do properly?
  2. How to I construct the COPY INTO query with my use case?
Jacek Laskowski
  • 72,696
  • 27
  • 242
  • 420
Glyngineer
  • 31
  • 2
  • I'm just starting to use COPY INTO so not going to answer as I am unsure. But I think you may want to use `PATTERN` instead of `FILES` and let Databricks figure out what files to process. – Michael West Jun 07 '21 at 18:43

0 Answers0