pyspark: org.apache.thrift.transport.TTransportException at ERROR

Question

I'm using Zeppelin Notebooks/Apache Spark and I am frequently getting the following error:

org.apache.thrift.transport.TTransportException at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132) at org.apache.thrift.transport.TTransport.readAll(TTransport.java:86) at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:429) at org.apache.thrift.protocol.TBinaryProtocol.readI32(TBinaryProtocol.java:318) at org.apache.thrift.protocol.TBinaryProtocol.readMessageBegin(TBinaryProtocol.java:219) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:69) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.recv_interpret(RemoteInterpreterService.java:249) at org.apache.zeppelin.interpreter.thrift.RemoteInterpreterService$Client.interpret(RemoteInterpreterService.java:233) at org.apache.zeppelin.interpreter.remote.RemoteInterpreter.interpret(RemoteInterpreter.java:269) at org.apache.zeppelin.interpreter.LazyOpenInterpreter.interpret(LazyOpenInterpreter.java:94) at org.apache.zeppelin.notebook.Paragraph.jobRun(Paragraph.java:279) at org.apache.zeppelin.scheduler.Job.run(Job.java:176) at org.apache.zeppelin.scheduler.RemoteScheduler$JobRunner.run(RemoteScheduler.java:328) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)

If I try to run the same code again (just ignoring the error), I get this (just top line):

java.net.SocketException: Broken pipe (Write failed)

Then if I try to run it a third time (or any time there-after), I get this error:

java.net.ConnectException: Connection refused (Connection refused)

If I restart the interpreter in Zeppelin Notebooks then it works (initially) but eventually I end up getting this error again.

This error has occurred throughout various steps in my process (data cleaning, vectorization, etc) but the most frequent time it occurs (by far) is when I'm fitting a model. To give you a better idea of what I'm actually doing and when it typically occurs, I'll walk you through my process:

I'm using Apache Spark ML and have done some standard vectorization, weighting, etc (CountVectorizer, IDF) and then building a model on that data.

I used VectorAssember to create my feature vector, translated that into a dense vector, and the converted it to a dataframe:

assembler = VectorAssembler(inputCols = ["fileSize", "hour", "day", "month", "punct_title", "cap_title", "punct_excerpt", "title_tfidf", "ct_tfidf", "excerpt_tfidf", "regex_tfidf"], outputCol="features")

vector_train = assembler.transform(train_raw).select("Target", "features")
vector_test = assembler.transform(test_raw).select("Target", "features")

train_final = vector_train.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))
test_final = vector_test.rdd.map(lambda x: Row(label=x[0],features=DenseVector(x[1].toArray())))

train_final_df = sqlContext.createDataFrame(train_final)
test_final_df = sqlContext.createDataFrame(test_final)

So the training set to be fed into the model looks like this (the actual dataset has ~15k columns & I downsampled to ~5k examples just to try to get it to run):

[Row(features=DenseVector([7016.0, 9.0, 16.0, 2.0, 2.0, 4.0, 5.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.315, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..................... 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 7.235, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]), label=0)]

The next step is fitting the model, this is where the error typically pops up. I have tried both fitting a single model and running CV (w/ParamGrid):

Single Model:

from pyspark.ml.classification import GBTClassifier
from pyspark.ml.evaluation import BinaryClassificationEvaluator

gbt = GBTClassifier(labelCol="label", featuresCol="features", maxDepth=8, maxBins=16, maxIter=40)
GBT_model = gbt.fit(train_final_df)

predictions_GBT = GBT_model.transform(test_final_df)
predictions_GBT.cache()
evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction")
auroc = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderROC"})
aupr = evaluator.evaluate(predictions_GBT, {evaluator.metricName: "areaUnderPR"})

With CV/PG:

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import GBTClassifier

GBT_model = GBTClassifier()

paramGrid = ParamGridBuilder() \
    .addGrid(GBT_model.maxDepth, [2,4]) \
    .addGrid(GBT_model.maxBins, [2,4]) \
    .addGrid(GBT_model.maxIter, [10,20]) \
    .build()

evaluator = BinaryClassificationEvaluator(rawPredictionCol="prediction", metricName="areaUnderPR")

crossval = CrossValidator(estimator=GBT_model, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=5) 

cvModel = crossval.fit(train_final_df)

I know it has something to do with the interpreter but can't figure out either: (a) What I'm doing wrong or (b) What to do to get around this glitch

UPDATE: I was asked for versions and memory configuration in the SO Apache Spark chat so I figured I would provide an update here.

Versions:

Spark: 2.0.1
Zeppelin: 0.6.2

Memory Configuration:

I am running on an EMR cluster using c1.xlarge EC2 (7 GiB) instance as my master and r3.8xlarge (244 GiB) for my core nodes
In Zeppelin, I went in and changed spark.driver.memory to 4g and spark.executor.memory to 128g

After I went in and set these Zeppelin memory configurations, I ran my code again and still got the same error.

I just started using Spark somewhat recently, are there other memory configurations that need to be set? Are these memory configurations not reasonable?

pyspark: org.apache.thrift.transport.TTransportException at ERROR

0 Answers0