Why does Pyspark throw : " AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table ". even though the file exists...?

Question

I am using databricks on azure, Pyspark reads data that's dumped in azure data lake storage [adls] Every now and then when i try to read the data from adls like so:
spark.read.format('delta').load(`/path/to/adls/mounted/interim_data.delta` )

it throws the following error

AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table.

the data necessarily exists
the folder contents and files show up when i run
%fs ls /path/to/adls/mounted/interim_data.delta

right now the only fix is to re run the script that populated the above interim_data.delta table which is not a viable fix

Hi @Steven ... df_table.write.format("delta").mode("overwrite").save(f"/dynamic/{path}/to/adls/interim_data.delta ") If it helps to know: there are multiple scripts that read off the interim_data.delta table created in above manner, its unlikely that the downstream scripts access the interim_data.delta table simultaneously — Rony, Jul 01 '21 at 08:52
OK. So both read and write look the same. Did you print the "path" to know if they are the same ? I see you use f-string, so did you check that they have exactly the same value ? (because currently, you are showing example path which are not the exact one you are using I guess) — Steven, Jul 01 '21 at 08:59
@Steven its not a path issue... yes the read and write paths are the same, although the write happens in upstream_NB and the read happens in the downstream_NB read can also be done by downstrean_NB1, downstrean_NB2 etc yes its an example path... but that captures the gist of it. — Rony, Jul 01 '21 at 09:22
that's the first time you try to read a delta table ? or you already manage to read another delta table ? You only have the issue with this one ? I know I am asking trivial questions but they help me with figuring out what could be wrong. — Steven, Jul 01 '21 at 09:24
First time that i am trying to read the delta table: no read the table a few times, and then for no reason it stops being readable. with the only way to fix it being: overwrite that table using the script that created it at the first place Already managed to read other delta tables : yes issue with only this one: no, other tables too show this behavior, intermittently I am guessing there is a problem with the way we are reading [multiple NBs at various points of time] or writing,but this seems to be the std. way to read write into the adls storage mounted on to dfbs at /mnt/somePoint — Rony, Jul 01 '21 at 09:28
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/234406/discussion-between-rony-and-steven). — Rony, Jul 01 '21 at 09:41

score 1 · Answer 1 · answered Jul 18 '21 at 05:23

I am answering my own question...

TLDR: Root cause of the issue: frequent remounting of ADLS

There was this section of the code that remounts the ADLS gen2 to ADB, when other teams ran their script, the remounting took 20-45 seconds, and as the number of scripts that ran in the high concurrency cluster increased, it was a matter of time that one of us ran into the issue, where the scripts tired to read data from the ADLS while it was being mounted...

this is how it turned out to be intermittent...

Why was this remounting hack in place..? this was put in place because, we faced an issue with data not showing up, in adb even though it was visible in ADLS Gen2, and the only way to fix this back then, was to force a remount to make that data visible in ADB.

score 0 · Answer 2 · answered Jul 16 '21 at 11:18

Make sure you have copied the data in delta format correctly.

Below is the standard command to do so:

df = spark.read.format(file_type).option("header","true").option("inferSchema", "true").option("delimiter", '|').load(file_location)

df.write.format("delta").save()

You access data in Delta tables either by specifying the path on DBFS ("/mnt/delta/events") or the table name ("events"). Make sure the path or file name should be in correct format. Please refer example below:

val events = spark.read.format("delta").load("/mnt/delta/events")

Refer https://learn.microsoft.com/en-us/azure/databricks/delta/quick-start#read-a-table to know more about Delta Lake.

Feel free to ask in case you have any query.

Why does Pyspark throw : " AnalysisException: `/path/to/adls/mounted/interim_data.delta` is not a Delta table ". even though the file exists...?

2 Answers2