Currently, I am working on a simple machine learning program that generates a PMML. For this experiment, I use PySpark as machine learning library and pyspark2pmml as PMML builder.
I have a problem when I want to build a PMML file. All the process from data loading until training model was no problem. However, I was unable to generate a pmml file with the transformed data from previous steps.
... (previous steps)
filtered_data = filtered_data.filter(data['year'] <= '2020')
cat_cols = ['CYC_CLASS', 'CYC_DISCHPORT']
num_cols = ['weekofyear']
label_cols = 'dwell_time'
# replace column names
replacements = {c:c.replace(c, 'pre_' + c) for c in filtered_data.columns if c in cat_cols}
filtered_df = filtered_data.select([col(c).alias(replacements.get(c, c)) for c in filtered_data.columns])
# prepare string indexer
si = [StringIndexer(inputCol='pre_' + d, outputCol=d, handleInvalid='keep') for d in cat_cols]
feat_cols = [d for d in cat_cols] + num_cols
print(feat_cols)
# create pipeline: indexing data into string index
pipeline_si = Pipeline(stages=si)
pipeline_si_data = pipeline_si.fit(filtered_df)
# transform data using string indexer
indexed_data = pipeline_si_data.transform(filtered_df)
indexed_data = indexed_data.select([label_cols] + feat_cols)
indexed_data.show(5)
# define and concat features columns
formula = RFormula(formula = "dwell_time ~ .")
# prepare classifier: we don't define featuresCol here, we will use Rformula instead !
dt_clf = DecisionTreeClassifier(maxBins=10)
# create pipeline: training classifier using training data
classififer_pipeline = Pipeline(stages =[formula, dt_clf])
dts_model = classififer_pipeline.fit(indexed_data)
print(dts_model)
Code for exporting to PMML file:
from pyspark2pmml import PMMLBuilder
pmmlDTs = PMMLBuilder(spark, indexed_data, dts_model)
pmmlDTs.buildFile("test_dtsi_22121230556.pmml")
Interestingly, when I convert indexed_data
into CSV file then load it back as DataFrame, I can generate a PMML file.
# prepare data
test_df = spark.read.csv('indexed.csv', header=True, inferSchema = True)
# define and concat features columns
formula = RFormula(formula = "dwell_time ~ .")
# prepare classifier: we don't define featuresCol here, we will use Rformula instead !
dt_clf = DecisionTreeClassifier(maxBins=10)
# create pipeline: training classifier using training data
classififer_pipeline = Pipeline(stages =[formula, dt_clf])
dts_model = classififer_pipeline.fit(test_df)
print(dts_model)
pmmlDTs = PMMLBuilder(spark, test_df, dts_model)
pmmlDTs.buildFile("test_dtsi_22121230556www.pmml")
out:
'D:\\MYWORLD\\CODE\\pmml\\test_dtsi_22121230556www.pmml'
Can anyone help me to figure out what is my mistake in the first approach? and What is the difference with second approach? Since I am new at using PySpark.Thanks.