0

I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work with even 1% of my data (~300k observations). Below is a snippet of my code. I am unable to share any data for you, but all features are numeric (either a numerical variable or a dummy variable for various factor levels). I use VectorAssembler to create a features variable containing the vector of features from the corresponding observation.

When I reduce the number of features used by the model, say to 5, the model runs without issue. Only when I make the problem more complex by adding a large number of features does it begin to fail. The error I get is a TTransport Exception. The model will try to run for hours before it errors out. I am building my model using Qubole. I am new to both Qubole and PySpark, so I'm not sure if my issue is a spark memory issue, Qubole memory (my cluster has 4+ TB, data only a few GB), etc.

Any thoughts or ideas for testing/debugging would be helpful. Thanks.

train = train.withColumnRenamed(target, "label")
test = test.withColumnRenamed(target, "label")

evaluator = BinaryClassificationEvaluator()
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
gbtPredictions = gbtModel.transform(test)
gbtPredictions.select('label','rawPrediction', 'prediction', 'probability').show(10)

print("Test Area Under ROC: " + str(evaluator.evaluate(gbtPredictions, {evaluator.metricName: "areaUnderROC"})))
ErrorJordan
  • 611
  • 5
  • 15

1 Answers1

0

You would want to try this https://docs.qubole.com/en/latest/troubleshooting-guide/notebook-ts/troubleshoot-notebook.html#ttexception. If this still doesn't help feel free to create us a support ticket and we would be happy to investigate.