How to replace the DataField values with exact column names in Spark-MLlib PMML file?

Question

I use Spark 2.1.0.

I've been trying to export Spark-MLlib Linear Regression model as PMML file. I've also successfully exported the PMML file. But in that file, I couldn't see any field name in it. All I can see is like this,

Can anyone let me know what's the reason for this? Also, please let me know how to obtain the column names in place of that.

score 1 · Answer 1 · answered Jun 08 '17 at 07:35

There are two approaches to exporting Apache Spark models into PMML data format. First, when working at Spark ML abstraction level, then you can use the JPMML-SparkML library. Second, when working at Spark MLlib abstraction level, which appears to be the case here, then you can use the built-in PMMLExportable trait.

JPMML-SparkML retrieves column names from the Spark ML data schema via DataFrame#schema(). Unfortunately, there is no such option for Spark MLlib, so feature names "field_{n}" and the label name "target" are simply dummy hard-coded names.

It is fairly easy to rename fields in the PMML document using the JPMML-Model library:

pmmlExportable.toPMML("/tmp/raw-pmml-file")
org.dmg.pmml.PMML pmml = org.jpmml.model.JAXBUtil.unmarshal("/tmp/raw-pmml-file");
org.jpmml.model.visitors.FieldRenamer targetRenamer = new FieldRenamer(FieldName.create("target"), FieldRenamer.create("y"));
targetRenamer.applyTo(pmml);
org.jpmml.model.JAXBUtil.marshal(pmml, "/tmp/final-pmml-file");

If you marshal this PMML object instance to a PMML file, then you can see that the field "target" (and all its references) has been renamed to "y". Repeat the procedure with features.

How to replace the DataField values with exact column names in Spark-MLlib PMML file?

1 Answers1