I'm new to Spark and PySpark. I downloaded data from here (1,75GB compression of multiple .csv files) and I stored them in my disk D, separetely from my Spark installation and my PySpark script on my disk C.
When I try to read them I have the following an error :
---------------------------------------------------------------------------
AnalysisException Traceback (most recent call last)
Cell In[12], line 3
1 df = spark.read.option("header", True) \
2 .option("inferSchema", True) \
----> 3 .csv("\airport_delay")
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\readwriter.py:535, in DataFrameReader.csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling)
533 if type(path) == list:
534 assert self._spark._sc._jvm is not None
--> 535 return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
536 elif isinstance(path, RDD):
538 def func(iterator):
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args)
1315 command = proto.CALL_COMMAND_NAME +\
1316 self.command_header +\
1317 args_command +\
1318 proto.END_COMMAND_PART
1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
1322 answer, self.gateway_client, self.target_id, self.name)
1324 for temp_arg in temp_args:
1325 temp_arg._detach()
File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\utils.py:196, in capture_sql_exception.<locals>.deco(*a, **kw)
192 converted = convert_exception(e.java_exception)
193 if not isinstance(converted, UnknownException):
194 # Hide where the exception came from that shows a non-Pythonic
195 # JVM exception message.
--> 196 raise converted from None
197 else:
198 raise
AnalysisException: Path does not exist: file:/C:/Users/Travail/Documents/PySpark/irport_delay
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master("local[1]") \
.appName("Test1") \
.getOrCreate()
df = spark.read.option("header", True) \
.option("inferSchema", True) \
.csv("file:\\\D:\Dataset\airport_delay")
How can I read data from an other disk with PySpark ? Or is it a nonsense to do so ?
I tried : - add/remove "file:\" - read Spark Configuration Doc and looked for something similar to "spark.sql.warehouse.dir"