3

I have recently started working on Databricks and I have been trying to find a way to perform a merge statement on a Delta table, though using an R api (preferably sparklyr). The ultimate purpose is to somehow impose a 'duplicate' constraint as described here. The aforementioned documentation describes the Python workflow :

deltaTable.alias("logs").merge(
    newDedupedLogs.alias("newDedupedLogs"),
    "logs.uniqueId = newDedupedLogs.uniqueId") \
  .whenNotMatchedInsertAll() \
  .execute()

however, I was wondering whether there is a straight-forward way to achieve this through R. Any assistance/ideas on the matter will be much appreciated since I am a new user (as mentioned above). Thanks in advance!

Community
  • 1
  • 1
takmers
  • 71
  • 1
  • 5

2 Answers2

4

Providing this answer since the you commented that there is no R Delta Lake API support. There is now a new R package that provides an R API for Delta Lake: dlt. The syntax is very similar to that of the Python API for Delta Lake.

In the case of your example:

# Install and laod the `dlt` package
remotes::install_gitlab("zero323/dlt")
library(dlt)
...

# Use the Delta Lake R API from the dlt package
deltaTable <- dlt_for_path("<path to table>")

deltaTable %>%
  dlt_alias("logs") %>%
  dlt_merge(alias(newDedupedLogs, "newDedupedLogs"), expr("newDedupedLogs.uniqueId = logs.uniqueId")) %>%
  dlt_when_not_matched_insert_all() %>%
  dlt_execute()
Bram
  • 376
  • 1
  • 4
3

There is no official R API for Delta Lake, but you can just use SQL's MERGE INTO command for that (I often do it in Scala/Python, just because it's easier to read, at least for me). You need to register a temp view with data that you want to put into destination table, and just run sql, like this (string is split for readability):

library(SparkR)
updates_df <- ...get updates...
createOrReplaceTempView(updates, "updates")
result <- sql(
  "MERGE INTO <your_table> AS target USING updates 
   ON target.id = updates.id WHEN NOT MATCHED THEN INSERT *")
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • That's sad, ref to no official support for R Delta Lake API support. Anyhow, thanks a lot for the solution and feedback provided ; any case I can avoid the TempView? I suppose not, correct? – takmers Nov 23 '21 at 21:00
  • You can theoretically write down to disk and use it, but it would be worse than temp view – Alex Ott Nov 23 '21 at 21:12
  • 3
    Btw, it looks like some wrapper for R is released, although not official – Alex Ott Nov 27 '21 at 18:19