1

I'm new to using Databricks and I'm trying to test the validity of continuously loading an hourly file into primary that will be used for reporting. Each hourly file is roughly 3-400gb and contains ~1-1.3b records. I would like to have the primary table store ~48 hours worth of data, but I really only need 6 hourly files to complete a view of my data.

My current process is below and it seems to work ok. The csv hourly file is stored on Azure DataLake (Gen1) and the primary table is using ADL Gen2 as the storage. Are these the best options? Does this process seem sound or am I doing something horribly wrong? :)

csvdata = spark.read.format("csv").option("header","true").option("ignoreLeadingWhiteSpace","true").option("ignoreTrailingWhiteSpace","true").option("timestampFormat","yyyy-MM-dd HH:mm:ss.SSS").option("delimiter","|").option("inferSchema","true").option("mode","FAILFAST").csv("adl://pathToCsv").createOrReplaceTempView("tempdata").cache()

```sql Merge
MERGE INTO primaryTable
USING tempdata
ON primaryTable.UserGuid = tempdata.UserGuid AND primaryTable.OrgGuid = tempdata.OrgGuid
WHEN MATCHED AND cast(unix_timestamp(primaryTable.timestamp,'yyyy/MM/dd HH:mm:ss.SSS') AS timestamp) < cast(unix_timestamp(tempdata.timestamp,'yyyy/MM/dd HH:mm:ss.SSS') AS timestamp) THEN
  UPDATE SET *
WHEN NOT MATCHED
  THEN INSERT *
Michael Armbrust
  • 1,545
  • 11
  • 12
Kevin Bain
  • 23
  • 3
  • I doubt there is any use of `.cache()`. Did you compare spark plans with and without? – Sai May 02 '19 at 22:10

0 Answers0