I'm currently trying to implement a number of delta lake tables for storing our 'foundation' data.
These tables will be build from delta data ingested into our 'raw' zone in our data lake in the following format:
- /raw/MyDataSource/2020/11/23/EntityOne_1056.parquet
- /raw/MyDataSource/2020/11/23/EntityTwo_1059.parquet
- /raw/MyDataSource/2020/11/22/EntityOne_0956.parquet
I can go about this using a series of loops and dataframes for reading the existing data and merging it with the new data however I think there is an easier way?
I've discovered the COPY INTO
command (https://docs.databricks.com/spark/latest/spark-sql/language-manual/delta-copy-into.html) which appears to take the headache of doing the merge and managing which batched files have been processed into the delta lake table, but I'm not sure how to implement this and the only examples I can find are very limited.
COPY INTO table_identifier
FROM [ file_location | (SELECT identifier_list FROM file_location) ]
FILEFORMAT = data_source
[FILES = [file_name, ... | PATTERN = 'regex_pattern']
[FORMAT_OPTIONS ('data_source_reader_option' = 'value', ...)]
[COPY_OPTIONS 'force' = ('false'|'true')]
I think I first need to create a list of the expected entity names, then use that to run a COPY INTO
command for each entity/delta table.
I need to make the command look at our 'Raw' zone in the data lake and find all files for that entity (probably using wildcards in the file path or a RegEx pattern). Something like:
"/raw/MyDataSource/*/*/*/{0}_*.parquet".format("EntityOne")
Finally I then need to call the command query using either a spark.sql(..)
command or by using a SQL cell in my Databricks notebook.
My questions:
- Have I understood what the
COPY INTO
command can do properly? - How to I construct the
COPY INTO
query with my use case?