using boilerpipe with pyspark

Question

I am using boilerpipe to get text out of html. However there is some issue that I have not been able to resolve. I have a list of 50k elements. I am creating an rdd of 1000 elements and then processing them and saving the resultant rdd in hdfs. The error that I have encountered is this:

ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 265, in <module>
    x = get_data(line[:-1],c)
  File "/home/hadoopuser/CommonCrawl_Spark/CommonCrawl_Spark/all.py", line 208, in get_data
    sc.parallelize(warcrecords).repartition(72).map(lambda s: classify(s)).saveAsTextFile(file_name)
  File "/home/hadoopuser/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 1552, in saveAsTextFile
  File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/home/hadoopuser/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o40.saveAsTextFile
17/09/19 18:11:10 INFO SparkContext: Invoking stop() from shutdown hook
17/09/19 18:11:10 INFO SparkUI: Stopped Spark web UI at http://192.168.0.255:4040
17/09/19 18:11:10 INFO DAGScheduler: Job 0 failed: saveAsTextFile at NativeMethodAccessorImpl.java:0, took 14.746797 s
17/09/19 18:11:10 INFO DAGScheduler: ResultStage 1 (saveAsTextFile at NativeMethodAccessorImpl.java:0) failed in 7.906 s due to Stage cancelled because SparkContext was shut down
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerStageCompleted(org.apache.spark.scheduler.StageInfo@ec3ca3)
17/09/19 18:11:10 ERROR LiveListenerBus: SparkListenerBus has already stopped! Dropping event SparkListenerJobEnd(0,1505824870317,JobFailed(org.apache.spark.SparkException: Job 0 cancelled because SparkContext was shut down))
17/09/19 18:11:10 INFO StandaloneSchedulerBackend: Shutting down all executors
17/09/19 18:11:10 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
17/09/19 18:11:10 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/09/19 18:11:10 INFO MemoryStore: MemoryStore cleared
17/09/19 18:11:10 INFO BlockManager: BlockManager stopped
17/09/19 18:11:10 INFO BlockManagerMaster: BlockManagerMaster stopped
17/09/19 18:11:10 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/09/19 18:11:10 INFO SparkContext: Successfully stopped SparkContext
17/09/19 18:11:10 INFO ShutdownHookManager: Shutdown hook called
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763/pyspark-b73e541b-1182-4449-96bc-26eabca1803d
17/09/19 18:11:10 INFO ShutdownHookManager: Deleting directory /tmp/spark-35ea0cd4-4b78-408b-8c3a-9966c1f84763

In the hdfs file, resultant of first 1000 elements are saved but going onwards it throws the above error. What is the fix to this?

The last line. `An error occurred while calling`. Go find that error. Check out the Spark History server executor /driver logs — OneCricketeer, Sep 19 '17 at 13:32
`Stage cancelled because SparkContext was shut down`... `INFO SparkContext: Invoking stop()`... Are you manually calling stop? — OneCricketeer, Sep 19 '17 at 13:43
no. I am not manually stopping but what i guess is happening once the port gets engaged after first 1000 elements write the port isn't available for further write. I am not sure about this but I found somewhere something like this. if this helps? — Ravi Ranjan, Sep 19 '17 at 15:00

score 0 · Answer 1 · answered Sep 20 '17 at 09:49

0

removing this line from the code did the trick. still don't know why.

from boilerpipe.extract import Extractor

answered Sep 20 '17 at 09:49

Ravi Ranjan

353
1
6
22

using boilerpipe with pyspark

1 Answers1