0

I am trying to get a customer's first_name before it was updated in a Databricks delta table. However, I am getting ParseException while trying SQL below.

select c_first_name from  customer_table where c_customer_sk=1 version as of 0

The workaround I am using is as below :

df = spark.sql("select c_customer_sk, c_first_name from  customer_table version as of 0")
name_before_update = (df
                      .filter("c_customer_sk=1")
                      .select("c_first_name")

The above solution is working perfectly. But this has two steps instead of one. Do we have a better solution?

Related question - Is the "where" clause illegal in the time travel query?

soumya-kole
  • 1,111
  • 7
  • 18
  • I just found an alternative syntax is working - "select c_first_name from customer_table@v0 where c_customer_sk=1". However, would love to know what is wrong with "select c_first_name from customer_table where c_customer_sk=1 version as of 0" – soumya-kole Mar 16 '23 at 00:55
  • Yes, in your case the problem is with the syntax. The `VERSION AS OF 0` associated with the `FROM` clause. – arudsekaberne Mar 16 '23 at 02:02

1 Answers1

0

First lets see how our delta table transaction log looks like.

DESCRIBE DETAIL <delta_table>;

Output:

+-------+-------------------+----------------+--------------------+--------------------+--------------------+----+-----------------+--------------------+-----------+-----------------+-------------+--------------------+------------+--------------------+
|version|          timestamp|          userId|            userName|           operation| operationParameters| job|         notebook|           clusterId|readVersion|   isolationLevel|isBlindAppend|    operationMetrics|userMetadata|          engineInfo|
+-------+-------------------+----------------+--------------------+--------------------+--------------------+----+-----------------+--------------------+-----------+-----------------+-------------+--------------------+------------+--------------------+
|      0|2023-03-16 01:34:22|3281063801330285|abc@g...|CREATE TABLE AS S...|{isManaged -> tru...|null|{432383257327813}|0316-012541-6spdz4dm|       null|WriteSerializable|         true|{numFiles -> 5, n...|        null|Databricks-Runtim...|
+-------+-------------------+----------------+--------------------+--------------------+--------------------+----+-----------------+--------------------+-----------+-----------------+-------------+--------------------+------------+--------------------+

We can see that each transaction has a version number and timestamp associate with it. These are the following ways to query our older data:

  1. Using Timestamp:
SELECT * FROM <delta_table>
TIMESTAMP AS OF "2023-03-16 01:34:22";
  1. Using Version number:
SELECT * FROM <delta_table>
VERSION AS OF <version_number>;

(or)

SELECT * FROM <delta_table>@v<version_number>;

In your case, the query should like:

SELECT c_first_name
FROM customer_table VERSION AS OF 0
WHERE c_customer_sk = 1;
arudsekaberne
  • 830
  • 4
  • 11