1

I have a dataframe(df) with 1 million rows and two columns (ID (long int) and description(String)). After transforming them into tfidf (using Tokenizer, HashingTF, and IDF), the dataframe, df has two columns (ID and features (sparse vector).

I computed the item-item similarity matrix using udf and dot function.

Computing the similarities is done successfully.

However, when I'm calling the show() function getting

"raise EOFError"

I read so many questions on this issue but did not get right answer yet.

Remember, if I apply my solution on a small dataset (like 100 rows), everything is working successfully.

Is it related to the out of memory issue?

I checked my dataset and description information, I don't see any records with null or unsupported text messages

    dist_mat = data.alias("i").join(data.alias("j"), psf.col("i.ID") < psf.col("j.ID")) \
        .select(psf.col("i.ID").alias("i"), psf.col("j.ID").alias("j"),
                dot_udf("i.features", "j.features").alias("score"))

dist_mat = dist_mat.filter(psf.col('score') > 0.05)

dist_mat.show(1)```


If I removed the last line dist_mat.show(), it is working without error. However, when I used this line, got the error like
.......
```Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded```
...
Here is the part of the error message:
```[Stage 6:=======================================================> (38 + 1) / 39]Traceback (most recent call last):
  File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/usr/local/Cellar/apache-spark/2.4.0/libexec/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
    raise EOFError
EOFError```
  • Are you sure that it is calculated successfully? Keep in mind that spark is [lazy](https://stackoverflow.com/questions/38027877/spark-transformation-why-its-lazy-and-what-is-the-advantage) evaluated. – cronoik Jul 06 '19 at 11:54
  • Thanks @cronoik for your response. I think it is calculated successfully. I checked it by adding an instruction after the computation like ```dist_mat.printSchema()```. The error is generating after printing the schema. I also tested by removing the filter instruction but I'm still getting same error. Since my solution is working without any error and I'm guessing it is probably memory related error, I'm thinking to run it using an EMR cluster with more memory. I'll let you know whatever I see after that. – Shariful Islam Jul 08 '19 at 13:25
  • I also think that this a memory and not logic related issue. You are running out of memory as soon as 'show()' is called, not because `show()` requires much memory but because it triggers all the calculations. 'printSchema()' doesn't trigger the calculations. Try it with more memory. – cronoik Jul 08 '19 at 16:16

1 Answers1

0

I increased the cluster size and run it again. It is working without errors. So, the error message is true Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded

However, computing the pairwise similarities for such a large scale matrix, I found an alternative solution, Large scale matrix multiplication with pyspark

In fact, it is very efficient and much more faster, even better than the use of BlockMatrix