0

Currently, I am working on a simple machine learning program that generates a PMML. For this experiment, I use PySpark as machine learning library and pyspark2pmml as PMML builder.

I have a problem when I want to build a PMML file. All the process from data loading until training model was no problem. However, I was unable to generate a pmml file with the transformed data from previous steps.


... (previous steps)

filtered_data = filtered_data.filter(data['year'] <= '2020')

cat_cols = ['CYC_CLASS', 'CYC_DISCHPORT']
num_cols = ['weekofyear']
label_cols = 'dwell_time'

# replace column names
replacements = {c:c.replace(c, 'pre_' + c) for c in filtered_data.columns if c in cat_cols}
filtered_df = filtered_data.select([col(c).alias(replacements.get(c, c)) for c in filtered_data.columns])

# prepare string indexer
si = [StringIndexer(inputCol='pre_' + d, outputCol=d, handleInvalid='keep') for d in cat_cols]
feat_cols = [d for d in cat_cols] + num_cols
print(feat_cols)

# create pipeline: indexing data into string index
pipeline_si = Pipeline(stages=si)
pipeline_si_data = pipeline_si.fit(filtered_df)

# transform data using string indexer
indexed_data = pipeline_si_data.transform(filtered_df)
indexed_data = indexed_data.select([label_cols] + feat_cols)
indexed_data.show(5)

# define and concat features columns
formula = RFormula(formula = "dwell_time ~ .")

# prepare classifier: we don't define featuresCol here, we will use Rformula instead !
dt_clf = DecisionTreeClassifier(maxBins=10)

# create pipeline: training classifier using training data
classififer_pipeline = Pipeline(stages =[formula, dt_clf])
dts_model = classififer_pipeline.fit(indexed_data)
print(dts_model)

Code for exporting to PMML file:

from pyspark2pmml import PMMLBuilder
pmmlDTs = PMMLBuilder(spark, indexed_data, dts_model)
pmmlDTs.buildFile("test_dtsi_22121230556.pmml")

and I got the error enter image description here

Interestingly, when I convert indexed_data into CSV file then load it back as DataFrame, I can generate a PMML file.

# prepare data
test_df = spark.read.csv('indexed.csv', header=True, inferSchema = True)

# define and concat features columns
formula = RFormula(formula = "dwell_time ~ .")

# prepare classifier: we don't define featuresCol here, we will use Rformula instead !
dt_clf = DecisionTreeClassifier(maxBins=10)

# create pipeline: training classifier using training data
classififer_pipeline = Pipeline(stages =[formula, dt_clf])
dts_model = classififer_pipeline.fit(test_df)
print(dts_model)

pmmlDTs = PMMLBuilder(spark, test_df, dts_model)
pmmlDTs.buildFile("test_dtsi_22121230556www.pmml")

out: 
'D:\\MYWORLD\\CODE\\pmml\\test_dtsi_22121230556www.pmml'

Can anyone help me to figure out what is my mistake in the first approach? and What is the difference with second approach? Since I am new at using PySpark.Thanks.

furanzup
  • 91
  • 1
  • 8

1 Answers1

0

Can anyone help me to figure out what is my mistake in the first approach?

Your conceptual mistake is that you're applying the RFormula transformation to an indexed data.

The main value proposition of RFormula is that it will inspect the type of incoming columns (eg. continuous vs. categorical), and perform any essential transformations automatically (eg. keeping continuous columns as-is vs. applying string indexing to all categorical columns).

Right now, the RFormula transformation is probably attempting to apply a "second layer" of string indexing to categorical columns. The PMML converter analyzes the pipeline model, and finds this sequence of operations non-sensical, and hence bails out with an exception.

What is the difference with second approach?

When you save your doubly string-indexed data into a CSV file, and load it back from there, then you get a different data schema.

The CSV file likely contains all-numeric columns, and the new CSV data schema probably declares that the pipeline is dealing with all continuous columns now.

user1808924
  • 4,563
  • 2
  • 17
  • 20