Spark Process Dataframe with Random Forest

Question

Using the answer to Spark 1.5.1, MLLib Random Forest Probability, I was able train a random forest using ml.classification.RandomForestClassifier, and process a holdout dataframe with the trained random forest.

The problem I have is that I would like to save this trained random forest to process any dataframe (with the same features as the training set) in the future.

The classification example on this page uses mllib.tree.model.RandomForestModel, it shows how to save the trained forest, but to the best of my understanding can only be trained on (and processed on in the future) a LabeledPoint RDD. The issue I have with the LabeledPoint RDD is that this can only contain the label double and features vector, so I would lose all the non-label/non-feature columns that I would need for other purposes.

So I guess I need a way to either save the result of ml.classification.RandomForestClassifier, or construct a LabeledPoint RDD that that can retain columns other than the label and features required by the forest trained through mllib.tree.model.RandomForestModel.

Anyone know why both and not only one of the ML and MLlib libraries exist?

Many thanks for reading my question, and thanks in advance for any solutions/suggestions.

score 0 · Accepted Answer · answered Jan 24 '16 at 20:35

0

I'll just re-use what's been said in the spark programming guide :

The spark.ml package aims to provide a uniform set of high-level APIs built on top of DataFrames that help users create and tune practical machine learning pipelines.

In Spark, the core feature is it's RDDs. There is an excellent paper on that topic if you are interested, I can add the link to it later.

The comes MLLib, which was an independent library at first and got soaked into the Spark project. Nevertheless, all the machine learning algorithms in Spark are written on RDDs.

Then the DataFrame abstraction were added to the project and thus a more practical ways of building machine learning applications were needed to include transformers and evaluator and most importantly pipeline.

Data Engineer or Scientist for that matter didn't need to study the underlying tech. Thus the abstraction.

You can use both, but you need to remember that all the algorithm that you use from ML are made in MLlib and then abstracted for a easier usage.

answered Jan 24 '16 at 20:35

eliasah

39,588
11
124
154

Thanks Eliasah, that clarifies the difference between ml and mllib. My main problem is being able to save a trained random forest, and to predict values for a dataframe and not a labeledpoint rdd. Any thoughts on this? – Benji Kok Jan 25 '16 at 04:55
What is the problems with prediction ? Have you read my answer on the link you have posted in your question ? You are not being very clear here, I'm afraid. You want to use ML save and predict on it ? – eliasah Jan 25 '16 at 07:20
ML allows you to predict on a dataframe, but does not allow you to save. MLLIB allows you to save, but does not allow you to predict on a dataframe (only on a labelledpoint RDD). I would like to save the random forest (like MLLIB allows, and ML doesn't allow) and use the saved forest to predict for a dataframe (like MLLIB doesn't allow, and ML does allow) – Benji Kok Jan 25 '16 at 07:35
It's the same algorithm by the end. But unfortunately, the save method is not available in ML for the moment. I believe there is a JIRA issue for that, so a solution for now will be to create your model with labeledpoint and mllib. Unfortunately, even the PMML export isn't available for RF on ML. – eliasah Jan 25 '16 at 07:49

Spark Process Dataframe with Random Forest

1 Answers1