1

I couldn't find any reference to access data from Delta using SparkR so I tried myself. So, fist I created a Dummy dataset in Python:

from pyspark.sql.types import StructType,StructField, StringType, IntegerType
data2 = [("James","","Smith","36636","M",2000),
    ("Robert","","Williams","42114","M",5000),
    ("Maria","Anne","Jones","39192","F",5000),
    ("Jen","Mary","Brown","","F",-1)
  ]

schema = StructType([ \
    StructField("firstname",StringType(),True), \
    StructField("middlename",StringType(),True), \
    StructField("lastname",StringType(),True), \
    StructField("id", StringType(), True), \
    StructField("gender", StringType(), True), \
    StructField("salary", IntegerType(), True) \
  ])
 
df = spark.createDataFrame(data=data2,schema=schema)

df.write \
  .format("delta")\
  .mode("overwrite")\
  .option("userMetadata", "first-version") \
  .save("/temp/customers")

You can modify this code to change the data and run again to simulate the change over time.

I can query in python using this:

df3 = spark.read \
  .format("delta") \
  .option("timestampAsOf", "2020-11-30 22:03:00") \
  .load("/temp/customers")
df3.show(truncate=False)

But I don't know how to pass the option in Spark R, can you help me?

%r
library(SparkR)
teste_r <- read.df("/temp/customers", source="delta")
head(teste_r)

It works but returns only the current version.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Alex
  • 73
  • 1
  • 8

2 Answers2

2

timestampAsOf will work as a parameter in SparkR::read.df.

SparkR::read.df("/temp/customers", source = "delta", timestampAsOf = "2020-11-30 22:03:00")

This can be also done with SparkR::sql.

SparkR::sql('
SELECT *
FROM delta.`/temp/customers`
TIMESTAMP AS OF "2020-11-30 22:03:00"
')

Alternatively, to do it in sparklyr, use the timestamp parameter in spark_read_delta.

library(sparklyr)

sc <- spark_connect(method = "databricks")

spark_read_delta(sc, "/temp/customers", timestamp = "2020-11-30 22:03:00")
Paul
  • 8,734
  • 1
  • 26
  • 36
2

if you need to do this on local machine, this is what I use in windows:

your_connection = AzureStor::storage_container(AzureStor::storage_endpoint(your_link, key=your_key), "your_container")

readparquetR(pathtoread="blobpath/subdirectory/", filelocation = "azure", format="delta", containerconnection = your_connection) 

function: https://github.com/mkparkin/Rinvent

korayp
  • 37
  • 5