PySpark(Azure Notebooks) - (Over)writing a csv file in Gen2 Storage container when the dataframe has as a source a csv file

Question

I have an issue with loading a CSV to a dataframe add some rows and overwriting the csv from the source

To make it simple I tried to load the CSV and then overwrite it, this will also not work, and after that, I thought to make it more simple load the CSV and write it to a empty location. Also is not working.

Please help

I made a test script:

# Importing the different packages.

from datetime import datetime, timedelta
from delta import *



# Importing Spark Packages.
from pyspark.sql.functions import col, lit, hash, array, expr, when, regexp_replace, sha2, concat_ws, concat
from pyspark.sql.types import BooleanType, IntegerType, TimestampType

df_target = spark.read.options(header='True') \
    .csv ('abfss://sdas@da.dfs.core.windows.net/metadata/Test')


output_path = "abfss://sdas@da.dfs.core.windows.net/metadata/Test/"

df_target = df_target.repartition(1)
df_target.write.option("header", True).mode("overwrite").csv(output_path)

It gives the following error:

> ---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In [27], line 4
      1 output_path = "abfss://.dfs.core.windows.net/metadata/Test/"
      3 df_target = df_target.repartition(1)
----> 4 df_target.write.option("header", True).mode("overwrite").csv(output_path)
      5 #df_target.write.format("csv").mode("overwrite").save(output_path/test.csv)

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py:1240, in DataFrameWriter.csv(self, path, mode, compression, sep, quote, escape, header, nullValue, escapeQuotes, quoteAll, dateFormat, timestampFormat, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, charToEscapeQuoteEscaping, encoding, emptyValue, lineSep)
   1221 self.mode(mode)
   1222 self._set_opts(
   1223     compression=compression,
   1224     sep=sep,
   (...)
   1238     lineSep=lineSep,
   1239 )
-> 1240 self._jwrite.csv(path)

File ~/cluster-env/env/lib/python3.10/site-packages/py4j/java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py:190, in capture_sql_exception.<locals>.deco(*a, **kw)
    188 def deco(*a: Any, **kw: Any) -> Any:
    189     try:
--> 190         return f(*a, **kw)
    191     except Py4JJavaError as e:
    192         converted = convert_exception(e.java_exception)

File ~/cluster-env/env/lib/python3.10/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling o3902.csv.
: org.apache.spark.SparkException: Job aborted.
    at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:651)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:283)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:187)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:108)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$6(SQLExecution.scala:111)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:183)
    at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:97)
    at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:779)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:66)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:108)
    at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:104)
    at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:176)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:584)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:31)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267)
    at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
    at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:31)
    at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:560)
    at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:104)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:88)
    at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:82)
    at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:136)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:901)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:415)
    at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:382)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:241)
    at org.apache.spark.sql.DataFrameWriter.csv(DataFrameWriter.scala:892)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 7.0 failed 4 times, most recent failure: Lost task 0.3 in stage 7.0 (TID 22) (vm-d5418561 executor 1): java.io.FileNotFoundException: 
Operation failed: "The specified path does not exist.", 404, HEAD, https://.dfs.core.windows.net/zmsynapsedlhda/metadata/tmas/Test/test.csv?upn=false&action=getStatus&timeout=90

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
       
    at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:248)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:308)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:149)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2682)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2618)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2617)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2617)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1190)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1190)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1190)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2870)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2812)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2801)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
Caused by: java.io.FileNotFoundException: 
Operation failed: "The specified path does not exist.", 404, HEAD, https://.dfs.core.windows.net/zmsynapsedlhda/metadata/tmas/Test/test.csv?upn=false&action=getStatus&timeout=90

It is possible the underlying files have been updated. You can explicitly invalidate
the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by
recreating the Dataset/DataFrame involved.
       
    at org.apache.spark.sql.errors.QueryExecutionErrors$.readCurrentFileNotFoundError(QueryExecutionErrors.scala:661)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:248)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:308)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:149)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:460)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:140)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)

score 0 · Answer 1 · answered May 19 '23 at 11:35

It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

As per the error message, when you change a data in few rows of the table or dataframe and then quickly query the table or dataframe, this is the anticipated behavior and will face this error.

To resolve this you can try below solutions:

Refresh all cached records connected to the table with below command to fix the problem.

REFRESH TABLE [db_name.]table_name

Restart your spark pool.

I tried similar code and it's working fine for me.

I'm not using SQL or tables or views and if I restart my spark cluster I also lose my dataframe or am I missing something — chr1s84, May 19 '23 at 11:56

Kashyap · Answer 2 · 2023-05-19T16:12:48.743

0

Code you ran and code you posted are different. Specifically line 1: output_path = "abfss://.dfs.core.windows.net/metadata/Test/"

posted

output_path = "abfss://sdas@da.dfs.core.windows.net/metadata/tmas/Test/"

df_target = df_target.repartition(1)
df_target.write.option("header", True).mode("overwrite").csv(output_path)

from error

Cell In [27], line 4
      1 output_path = "abfss://.dfs.core.windows.net/metadata/Test/"
      3 df_target = df_target.repartition(1)
----> 4 df_target.write.option("header", True).mode("overwrite").csv(output_path)
      5 #df_target.write.format("csv").mode("overwrite").save(output_path/test.csv)

Change "abfss://.dfs.core..." to "abfss://output_path = "abfss://sdas@da.dfs.core..."

-Edit-

base_path = 'abfss://sdas@da.dfs.core.windows.net/metadata/'
df_target = spark.read.options(header='True').csv(f'{base_path}/Test').repartition(1)
df_target.show()
print(f'writing to: {base_path}/output')
df_target.write.option("header", True).mode("overwrite").csv(f'{base_path}/output')

edited May 19 '23 at 16:12

answered May 19 '23 at 15:42

Kashyap

15,354
13
64
103

Thats a mistake from me, but is not the issue. Both locations are the same – chr1s84 May 19 '23 at 15:57
mistake in what? Is the mistake in error message or the code you posted? If error message correct then clearly `abfss://.dfs.core.windows.net/metadata/Test/` is an invalid path and likely the problem. – Kashyap May 19 '23 at 16:04
@chr1s84 can you run the code I added to bottom of my post and post it's output? – Kashyap May 19 '23 at 16:14
Ok the code above works, but it does not solve my issue still have the same issue if I do this `base_path = 'abfss://sdas@da.dfs.core.windows.net/metadata/' df_target = spark.read.options(header='True').csv(f'{base_path}/Output').repartition(1) df_target.show() print(f'writing to: {base_path}/Intput') df_target.write.option("header", True).mode("overwrite").csv(f'{base_path}/Intput') print(f'writing to: {base_path}/Output') df_target.write.option("header", True).mode("overwrite").csv(f'{base_path}/Output')` It fails @ the overwriting the Output in the last two lines – chr1s84 May 20 '23 at 19:36
to provide an answer as useful as your comment (which just conveys "it doesn't work"): Although it fails @ the overwriting in the last two lines, the error could be anywhere, not just last two lines. – Kashyap May 22 '23 at 15:14

PySpark(Azure Notebooks) - (Over)writing a csv file in Gen2 Storage container when the dataframe has as a source a csv file

2 Answers2