pySpark failed to load my data - Py4JJavaError

Question

Jupyter Lab

I installed PySpark using !pip two days ago, and it was working perfectly fine. However, today I encountered an issue while trying to execute a task that required processing the entire dataset to obtain the results. Not only did the task fail, but my initial dataset, which was functioning correctly before, also failed to load. I have verified that I only have one version of PySpark installed. I even attempted using regular Jupyter Notebooks, but the problem persists. I have restarted the kernel and my computer multiple times, but unfortunately, I have had no success in resolving the issue.

I am using Ubuntu, and the data file I'm working with is 27 GB with 500 million rows. If anyone has any insights or possible solutions, I would greatly appreciate your help. Thank you.

The error message I am currently encountering is: --------------------------------------------------------------------------- Py4JJavaError Traceback (most recent call last) Cell In[5], line 1 ----> 1 df_pyspark = spark.read.option('header', 'true').csv('transactions14_0710.csv')

File ~/anaconda3/lib/python3.10/site-packages/pyspark/sql/readwriter.py:410, in DataFrameReader.csv(self, path, schema, sep, encoding, quote, escape, comment, header, inferSchema, ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace, nullValue, nanValue, positiveInf, negativeInf, dateFormat, timestampFormat, maxColumns, maxCharsPerColumn, maxMalformedLogPerPartition, mode, columnNameOfCorruptRecord, multiLine, charToEscapeQuoteEscaping, samplingRatio, enforceSchema, emptyValue, locale, lineSep, pathGlobFilter, recursiveFileLookup, modifiedBefore, modifiedAfter, unescapedQuoteHandling)

Have you seen this one for example? It seems to be the same error https://stackoverflow.com/questions/41840296/pyspark-in-ipython-notebook-raises-py4jjavaerror-when-using-count-and-first — Ziya Mert Karakas, Jul 26 '23 at 23:48

pySpark failed to load my data - Py4JJavaError

0 Answers0