Error writing parquet to specific container in Azure Data Lake

Question

I'm retrieving two files from container1, transforming them and merging before writing to a container2 within the same Storage Account in Azure. I'm mounting container1, unmouting and mounting countainer2 before writing.

My code for writing the parquet `

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df_spark.coalesce(1).write.option("header",True) \
        .partitionBy('ZMTART') \
        .mode("overwrite") \
        .parquet('/mnt/temp/')

I'm getting the following error when writing to container2:


---------------------------------------------------------------------------

Py4JJavaError                             Traceback (most recent call last)
<command-3769031361803403> in <cell line: 2>()
      1 spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
----> 2 df_spark.coalesce(1).write.option("header",True) \
      3         .partitionBy('ZMTART') \
      4         .mode("overwrite") \
      5         .parquet('/mnt/temp/')
 
/databricks/spark/python/pyspark/instrumentation_utils.py in wrapper(*args, **kwargs)
     46             start = time.perf_counter()
     47             try:
---> 48                 res = func(*args, **kwargs)
     49                 logger.log_success(
     50                     module_name, class_name, function_name, time.perf_counter() - start, signature
 
/databricks/spark/python/pyspark/sql/readwriter.py in parquet(self, path, mode, partitionBy, compression)
   1138             self.partitionBy(partitionBy)
   1139         self._set_opts(compression=compression)
-> 1140         self._jwrite.parquet(path)
   1141

The odd thing is writing the exact same dataframe to the container1 is no problem, even using the same code for writing but with different mount. Generating random data in the script and writing that to container2 is also no problem. Evidently, there is a problem with that specific dataframe in that specific container.

I'm fairly new to Databricks, so please let me know if there is additional information needed.

Tried transforming dataframe in pyspark pandas to spark, then writing to ADL.

Check if your ADLS container (mount point) has the write permissions enabled or not. Considering you have used OAuth authentication with an app registration, the container might not have this registration with write permissions and hence the error. — Saideep Arikontham, Nov 23 '22 at 05:44

score 0 · Answer 1 · answered Nov 24 '22 at 06:53

Before writing to your container2, make sure that you have write permissions given to this container. I have tried to write to my container without write permissions using similar code and got the error.

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")
df.coalesce(1).write.option("header",True) \
        .partitionBy('id') \
        .mode("overwrite") \
        .parquet('/mnt/repro/')

enter image description here

This was because my registered service principal does not have write permissions.

enter image description here

When you give the permissions and execute the code again, the operation would be successful.

enter image description here

Error writing parquet to specific container in Azure Data Lake

1 Answers1