I have a dataframe(df) with 1 million rows and two columns (ID (long int) and description(String)). After transforming them into tfidf (using Tokenizer, HashingTF, and IDF), the dataframe, df has two columns (ID and features (sparse vector).
I computed the item-item similarity matrix using udf and dot function.
Computing the similarities is done successfully.
However, when I'm calling the show() function getting
"raise EOFError"
I read so many questions on this issue but did not get right answer yet.
Remember, if I apply my solution on a small dataset (like 100 rows), everything is working successfully.
Is it related to the out of memory issue?
I checked my dataset and description information, I don't see any records with null or unsupported text messages
dist_mat = data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID")) \
.select(psf.col("i.ID").alias("i"), psf.col("j.ID").alias("j"),
dot_udf("i.features", "j.features").alias("score"))
dist_mat = dist_mat.filter(psf.col('score') > 0.05)
dist_mat.show(1)```
If I removed the last line dist_mat.show(), it is working without error. However, when I used this line, got the error like
.......
```Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded```
...
Here is the part of the error message:
```[Stage 6:=======================================================> (38 + 1) / 39]Traceback (most recent call last):
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError```