PySpark Machine Learning on Wide Data in Qubole

Question

I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work with even 1% of my data (~300k observations). Below is a snippet of my code. I am unable to share any data for you, but all features are numeric (either a numerical variable or a dummy variable for various factor levels). I use VectorAssembler to create a features variable containing the vector of features from the corresponding observation.

When I reduce the number of features used by the model, say to 5, the model runs without issue. Only when I make the problem more complex by adding a large number of features does it begin to fail. The error I get is a TTransport Exception. The model will try to run for hours before it errors out. I am building my model using Qubole. I am new to both Qubole and PySpark, so I'm not sure if my issue is a spark memory issue, Qubole memory (my cluster has 4+ TB, data only a few GB), etc.

Any thoughts or ideas for testing/debugging would be helpful. Thanks.

train = train.withColumnRenamed(target, "label")
test = test.withColumnRenamed(target, "label")

evaluator = BinaryClassificationEvaluator()
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
gbtPredictions = gbtModel.transform(test)
gbtPredictions.select('label','rawPrediction', 'prediction', 'probability').show(10)

print("Test Area Under ROC: " + str(evaluator.evaluate(gbtPredictions, {evaluator.metricName: "areaUnderROC"})))

score 0 · Answer 1 · answered Jan 03 '20 at 04:39

0

You would want to try this https://docs.qubole.com/en/latest/troubleshooting-guide/notebook-ts/troubleshoot-notebook.html#ttexception. If this still doesn't help feel free to create us a support ticket and we would be happy to investigate.

answered Jan 03 '20 at 04:39

Pradeep Gouru

26
1

Thanks for your time! I have looked at this and adjusted a few parameters, such as increasing `spark.driver.memory`, but still haven't been able to get around the issue. – ErrorJordan Jan 03 '20 at 15:32
@ErrorJordan Can you please create us a support ticket with notebook-id that have the issue ? We would be happy to look into further. – Pradeep Gouru Jan 08 '20 at 06:43
Pradeep - a ticket through Qubole? – ErrorJordan Jan 08 '20 at 14:22

PySpark Machine Learning on Wide Data in Qubole

1 Answers1