Reading data from an other disk with PySpark (Windows 10)

Question

I'm new to Spark and PySpark. I downloaded data from here (1,75GB compression of multiple .csv files) and I stored them in my disk D, separetely from my Spark installation and my PySpark script on my disk C.

When I try to read them I have the following an error :

---------------------------------------------------------------------------
AnalysisException                         Traceback (most recent call last)
Cell In[12], line 3
      1 df = spark.read.option("header", True) \
      2                 .option("inferSchema", True) \
----> 3                 .csv("\airport_delay")

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\readwriter.py:535, in DataFrameReader.csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling)
    533 if type(path) == list:
    534     assert self._spark._sc._jvm is not None
--> 535     return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
    536 elif isinstance(path, RDD):
    538     def func(iterator):

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\py4j\java_gateway.py:1321, in JavaMember.__call__(self, *args)
   1315 command = proto.CALL_COMMAND_NAME +\
   1316     self.command_header +\
   1317     args_command +\
   1318     proto.END_COMMAND_PART
   1320 answer = self.gateway_client.send_command(command)
-> 1321 return_value = get_return_value(
   1322     answer, self.gateway_client, self.target_id, self.name)
   1324 for temp_arg in temp_args:
   1325     temp_arg._detach()

File ~\AppData\Local\Programs\Python\Python311\Lib\site-packages\pyspark\sql\utils.py:196, in capture_sql_exception.<locals>.deco(*a, **kw)
    192 converted = convert_exception(e.java_exception)
    193 if not isinstance(converted, UnknownException):
    194     # Hide where the exception came from that shows a non-Pythonic
    195     # JVM exception message.
--> 196     raise converted from None
    197 else:
    198     raise

AnalysisException: Path does not exist: file:/C:/Users/Travail/Documents/PySpark/irport_delay

from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .master("local[1]") \
                    .appName("Test1") \
                    .getOrCreate()

df = spark.read.option("header", True) \
                .option("inferSchema", True) \
                .csv("file:\\\D:\Dataset\airport_delay")

How can I read data from an other disk with PySpark ? Or is it a nonsense to do so ?

I tried : - add/remove "file:\" - read Spark Configuration Doc and looked for something similar to "spark.sql.warehouse.dir"

score 0 · Answer 1 · answered Feb 09 '23 at 13:12

0

I tried to change all the "\" to "/" and it worked.

df = spark.read.option("header", True) \
            .option("inferSchema", True) \
            .csv("file:///D:/Dataset/airport_delay")

answered Feb 09 '23 at 13:12

Meshuggahat

41
7

Reading data from an other disk with PySpark (Windows 10)

1 Answers1