I have a large dataset, with roughly 250 features, that I would like to use in a gradient-boosted trees classifier. I have millions of observations, but I'm having trouble getting the model to work with even 1% of my data (~300k observations). Below is a snippet of my code. I am unable to share any data for you, but all features are numeric (either a numerical variable or a dummy variable for various factor levels). I use VectorAssembler
to create a features
variable containing the vector of features from the corresponding observation.
When I reduce the number of features used by the model, say to 5, the model runs without issue. Only when I make the problem more complex by adding a large number of features does it begin to fail. The error I get is a TTransport Exception
. The model will try to run for hours before it errors out. I am building my model using Qubole. I am new to both Qubole and PySpark, so I'm not sure if my issue is a spark memory issue, Qubole memory (my cluster has 4+ TB, data only a few GB), etc.
Any thoughts or ideas for testing/debugging would be helpful. Thanks.
train = train.withColumnRenamed(target, "label")
test = test.withColumnRenamed(target, "label")
evaluator = BinaryClassificationEvaluator()
gbt = GBTClassifier(maxIter=10)
gbtModel = gbt.fit(train)
gbtPredictions = gbtModel.transform(test)
gbtPredictions.select('label','rawPrediction', 'prediction', 'probability').show(10)
print("Test Area Under ROC: " + str(evaluator.evaluate(gbtPredictions, {evaluator.metricName: "areaUnderROC"})))