Jupyter Lab
I installed PySpark using !pip two days ago, and it was working perfectly fine. However, today I encountered an issue while trying to execute a task that required processing the entire dataset to obtain the results. Not only did the task fail, but my initial dataset, which was functioning correctly before, also failed to load. I have verified that I only have one version of PySpark installed. I even attempted using regular Jupyter Notebooks, but the problem persists. I have restarted the kernel and my computer multiple times, but unfortunately, I have had no success in resolving the issue.
I am using Ubuntu, and the data file I'm working with is 27 GB with 500 million rows. If anyone has any insights or possible solutions, I would greatly appreciate your help. Thank you.
The error message I am currently encountering is: --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In[5], line 1 ----> 1 df_pyspark = spark.read.option('header', 'true').csv('transactions14_0710.csv')
File ~/anaconda3/lib/python3.10/site-packages/pyspark/sql/readwriter.py:410, in DataFrameReader.csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling)