0

a MapReduce program in Spark that implements a simple “People You Might Know" social network friendship recommendation algorithm. The key idea is that if two people have a lot of mutual friends, then the system should recommend that they connect with each other.

Input file: https://s29.picofile.com/file/8463242792/inputsocial.txt.html

from pyspark.sql import SparkSession
spark = SparkSession.builder.config("spark.network.timeout", "600s").getOrCreate()
spark = SparkSession.builder.config("spark.driver.memory", "4g").getOrCreate()

spark = SparkSession.builder.getOrCreate()

file_path = "D:/que/inputsocial.txt"

data = spark.sparkContext.textFile(file_path)

 pairs = data.flatMap(lambda line: line.split("\t")) \
               .flatMap(lambda pair: [(pair.split(",")[0], friend) for friend in pair.split(",")[1:]])

 user_groups = pairs.groupByKey()

# Step 3: Generate friend pairs for each user
def generate_friend_pairs(user):
    friends = list(user[1])
    friend_pairs = []
    for i in range(len(friends)):
        for j in range(i+1, len(friends)):
            friend_pairs.append(((friends[i], friends[j]), 1))
    return friend_pairs

mutual_friend_pairs = user_groups.flatMap(generate_friend_pairs)

# Step 4: Count the number of mutual friends for each pair
mutual_friend_counts = mutual_friend_pairs.reduceByKey(lambda a, b: a + b)

# Collect the mutual friend counts RDD
mutual_friend_counts_list = mutual_friend_counts.collect()


# Step 5: Filter out already existing friends and sort recommendations
def filter_and_sort(user):
    recommendations = []
    user_id = user[0]
    friends = set(user[1])
    for (pair, count) in mutual_friend_counts_list:
        if user_id == pair[0] and pair[1] not in friends:
            recommendations.append((pair[1], count))
        elif user_id == pair[1] and pair[0] not in friends:
            recommendations.append((pair[0], count))
    recommendations.sort(key=lambda x: x[1], reverse=True)
    return recommendations[:10]

recommendations = user_groups.flatMap(filter_and_sort)


# Save the recommendations to an output file
output_file = "D:/que/output.txt"
recommendations.map(lambda x: "\t".join([str(x[0]), ",".join([str(user) for user in x[1]])])) \
               .saveAsTextFile(output_file)

and this part # Collect the mutual friend counts RDD mutual_friend_counts_list = mutual_friend_counts.collect() i get the following error:

Py4JJavaError                             Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_3420\955289775.py in <module>
      1 # Collect the mutual friend counts RDD
----> 2 mutual_friend_counts_list = mutual_friend_counts.collect()

~\anaconda3\lib\site-packages\pyspark\rdd.py in collect(self)
    948         """
    949         with SCCallSiteSync(self.context) as css:
--> 950             sock_info = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
    951         return list(_load_from_socket(sock_info, self._jrdd_deserializer))
    952 

~\anaconda3\lib\site-packages\py4j\java_gateway.py in __call__(self, *args)
   1319 
   1320         answer = self.gateway_client.send_command(command)
-> 1321         return_value = get_return_value(
   1322             answer, self.gateway_client, self.target_id, self.name)
   1323 

~\anaconda3\lib\site-packages\pyspark\sql\utils.py in deco(*a, **kw)
    109     def deco(*a, **kw):
    110         try:
--> 111             return f(*a, **kw)
    112         except py4j.protocol.Py4JJavaError as e:
    113             converted = convert_exception(e.java_exception)

~\anaconda3\lib\site-packages\py4j\protocol.py in get_return_value(answer, gateway_client, target_id, name)
    324             value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325             if answer[1] == REFERENCE_TYPE:
--> 326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
    328                     format(target_id, ".", name), value)

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 36.0 failed 1 times, most recent failure: Lost task 0.0 in stage 36.0 (TID 24) (DESKTOP-RE5RN59 executor driver): org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:115)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:750)
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
    ... 19 more

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
    at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
    at scala.Option.foreach(Option.scala:407)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2249)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2268)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2293)
    at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1021)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:406)
    at org.apache.spark.rdd.RDD.collect(RDD.scala:1020)
    at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180)
    at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.spark.SparkException: Python worker failed to connect back.
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:189)
    at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:109)
    at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:124)
    at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:164)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.api.python.PairwiseRDD.compute(PythonRDD.scala:115)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:365)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:329)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:136)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: java.net.SocketTimeoutException: Accept timed out
    at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native Method)
    at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:131)
    at java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:535)
    at java.net.PlainSocketImpl.accept(PlainSocketImpl.java:189)
    at java.net.ServerSocket.implAccept(ServerSocket.java:545)
    at java.net.ServerSocket.accept(ServerSocket.java:513)
    at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:176)
    ... 19 more

I updated py4j but it didn't work. I'm new in Spark

simin
  • 1
  • 1
  • Hi! Please provide a [minimal, reproducible example](https://stackoverflow.com/help/minimal-reproducible-example) so we can run the code ourselves and see what went awry - that is to say we need some data to run the code – Simon David May 12 '23 at 17:01
  • @SimonDavid Thank you very much for your attention and reminder, the input file has been added : s29.picofile.com/file/8463242792/inputsocial.txt.html – simin May 12 '23 at 18:10
  • What data format are you expecting from `pairs`? There might be a problem with the way you create the pairs because `pairs.take(10)` (which executes the logic creating pairs) takes a very long time to execute, after 6 minutes it still doesn't finish in my environment. Looking at your code, I see you handle `\t` characters but the file also contains `\n`, so you might need to revise this step. – Simon David May 12 '23 at 18:48
  • Welcome to SO! As @SimonDavid said, how the code is trying to parse and transform data is inefficient. For example, if you don't parse it twice, you can parse data more efficiently on the given `flatMap`. Generally, in processing data using `RDD`, the programmer must care about memory management more wisely. The general problem seems related to the usage of ram on the executor and it's unable to connect to the driver after a while. – meysam May 14 '23 at 08:29

0 Answers0