3

I have a PMML file (below) generated from an R linear model from my colleague that is to be used to predict the cost of an item based on 5 features. I am trying to consume this model using Augustus in Python and make these predictions. I have been successful in getting the PMML file loaded by Augustus but I am failing to get the predicted values.

I've gone through many examples from Augustus's Model abstraction and through searching Stack and Google but I have yet to find any examples of linear regression being successfully used. There was one similar question asked previously but it was never properly answered. I have also tried other example regression PMML files with similar results.

How can I run the regression using Augustus (or other library) in Python and obtain the predictions?

PMML Code: linear_model.xml

<?xml version="1.0"?>
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd">
 <Header copyright="Copyright (c) 2016 root" description="Linear Regression Model">
  <Extension name="user" value="root" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.4"/>
  <Timestamp>2016-02-02 19:20:59</Timestamp>
 </Header>
 <DataDictionary numberOfFields="6">
  <DataField name="cost" optype="continuous" dataType="double"/>
  <DataField name="quantity" optype="continuous" dataType="double"/>
  <DataField name="total_component_weight" optype="continuous" dataType="double"/>
  <DataField name="quantity_cost_mean" optype="continuous" dataType="double"/>
  <DataField name="mat_quantity_cost_mean" optype="continuous" dataType="double"/>
  <DataField name="solid_volume" optype="continuous" dataType="double"/>
 </DataDictionary>
 <RegressionModel modelName="Linear_Regression_Model" functionName="regression" algorithmName="least squares" targetFieldName="cost">
  <MiningSchema>
   <MiningField name="cost" usageType="predicted"/>
   <MiningField name="quantity" usageType="active"/>
   <MiningField name="total_component_weight" usageType="active"/>
   <MiningField name="quantity_cost_mean" usageType="active"/>
   <MiningField name="mat_quantity_cost_mean" usageType="active"/>
   <MiningField name="solid_volume" usageType="active"/>
  </MiningSchema>
  <Output>
   <OutputField name="Predicted_cost" feature="predictedValue"/>
  </Output>
  <RegressionTable intercept="-5.18924891969128">
   <NumericPredictor name="quantity" exponent="1" coefficient="0.0128484453941352"/>
   <NumericPredictor name="total_component_weight" exponent="1" coefficient="12.0357979395919"/>
   <NumericPredictor name="quantity_cost_mean" exponent="1" coefficient="0.500814050845585"/>
   <NumericPredictor name="mat_quantity_cost_mean" exponent="1" coefficient="0.556822746464491"/>
   <NumericPredictor name="solid_volume" exponent="1" coefficient="0.000197314943339284"/>
  </RegressionTable>
 </RegressionModel>
</PMML>

Python Code:

import pandas as pd
from augustus.strict import *

train_full_df = pd.read_csv('train_data.csv', low_memory=False)

model = modelLoader.loadXml('linear_model.xml')
dataTable = model.calc({'quantity': train_full_df.quantity[:10], 
                        'total_component_weight': train_full_df.total_component_weight[:10],
                        'quantity_cost_mean': train_full_df.quantity_cost_mean[:10],
                        'mat_quantity_cost_mean': train_full_df.mat_quantity_cost_mean[:10],
                        'solid_volume': train_full_df.solid_volume[:10],
                       })
dataTable.look()

(output)

#  | quantity   | total_comp | quantity_c | mat_quanti | solid_volu
---+------------+------------+------------+------------+-----------
0  | 1.0        | 0.018      | 32.2903337 | 20.4437141 | 1723.48653
1  | 2.0        | 0.018      | 17.2369194 | 12.0418426 | 1723.48653
2  | 5.0        | 0.018      | 10.8846412 | 7.22744702 | 1723.48653
3  | 10.0       | 0.018      | 6.82802948 | 4.3580642  | 1723.48653
4  | 25.0       | 0.018      | 4.84356482 | 3.09218161 | 1723.48653
5  | 50.0       | 0.018      | 4.43703495 | 2.74377648 | 1723.48653
6  | 100.0      | 0.018      | 4.22259101 | 2.5990824  | 1723.48653
7  | 250.0      | 0.018      | 4.1087198  | 2.53432422 | 1723.48653
8  | 1.0        | 0.018      | 32.2903337 | 20.4437141 | 1723.48653
9  | 2.0        | 0.018      | 17.2369194 | 12.0418426 | 1723.48653

As you can see from the table, only the input values are being displayed and no "cost" values. How do I get the cost to be predicted?

I am using Python 2.7, Augustus 0.6 (also tried 0.5), OS X 10.11

Community
  • 1
  • 1
  • scikit-learn is widely used python library for machine-learning models. It's very easy to use and straight forward. http://scikit-learn.org/stable/modules/linear_model.html – Sagar Waghmode Feb 04 '16 at 07:05
  • 1
    Yes, scikit-learn is great. I've used it on several projects. The issue I have here is that someone else is producing an R model that I want to consume in Python. The only bridge that I am aware of is for the R model to produce PMML and for my Python code to consume it using Augustus. Does scikit-learn allow me to read PMML? – Kevin Balkoski Feb 04 '16 at 07:14
  • I haven't done something like this. But do you need to pass cost for training data to train a regression model? And then pass some part of training data to cross-validate the model? – Sagar Waghmode Feb 04 '16 at 07:18
  • I've attempted to pass the known cost values for those items when calling model.calc() but it simply just displays the given cost values (as opposed to the predicted values) – Kevin Balkoski Feb 04 '16 at 07:26
  • 1
    I'm currently running into the same problem with a model generated by Knime. I will post an answer if I can get it working. I assume it's something simple. The documentation is not of high quality, so its conceivable what is needed is simply obscure. – Pete Mancini Mar 01 '16 at 19:18
  • @PeteMancini any success? – C8H10N4O2 Sep 26 '16 at 20:26
  • Not really, @c8h10n4O2. We updated Augustus environment which helped but there were performance issues and complications that weren't easily solved given the documentation. We found other ways to build and deliver our models. Probably not the answer you were hoping for. – Pete Mancini Oct 05 '16 at 18:55
  • @PeteMancini thanks for the update. I tried installing Augustus but I don't think it's Python 3 compatible. I'm looking into PFA instead of PMML but it looks even more complicated. What did you wind up using? – C8H10N4O2 Oct 06 '16 at 16:54
  • We are using spark and H20.ai. We are still interested in deployable models. Augustus was just too cranky for us to work with effectively. Some of our team are using Knime to make models. I've moved on to Julia because of cuda and parallelization needs. @C8H10N4O2 – Pete Mancini Oct 11 '16 at 18:46

1 Answers1

1

You could use the PyPMML to score PMML models in Python, takes your model as an example:

import pandas as pd
from pypmml import Model

model = Model.fromString('''<?xml version="1.0"?>
<PMML version="4.1" xmlns="http://www.dmg.org/PMML-4_1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_1 http://www.dmg.org/v4-1/pmml-4-1.xsd">
 <Header copyright="Copyright (c) 2016 root" description="Linear Regression Model">
  <Extension name="user" value="root" extender="Rattle/PMML"/>
  <Application name="Rattle/PMML" version="1.4"/>
  <Timestamp>2016-02-02 19:20:59</Timestamp>
 </Header>
 <DataDictionary numberOfFields="6">
  <DataField name="cost" optype="continuous" dataType="double"/>
  <DataField name="quantity" optype="continuous" dataType="double"/>
  <DataField name="total_component_weight" optype="continuous" dataType="double"/>
  <DataField name="quantity_cost_mean" optype="continuous" dataType="double"/>
  <DataField name="mat_quantity_cost_mean" optype="continuous" dataType="double"/>
  <DataField name="solid_volume" optype="continuous" dataType="double"/>
 </DataDictionary>
 <RegressionModel modelName="Linear_Regression_Model" functionName="regression" algorithmName="least squares" targetFieldName="cost">
  <MiningSchema>
   <MiningField name="cost" usageType="predicted"/>
   <MiningField name="quantity" usageType="active"/>
   <MiningField name="total_component_weight" usageType="active"/>
   <MiningField name="quantity_cost_mean" usageType="active"/>
   <MiningField name="mat_quantity_cost_mean" usageType="active"/>
   <MiningField name="solid_volume" usageType="active"/>
  </MiningSchema>
  <Output>
   <OutputField name="Predicted_cost" feature="predictedValue"/>
  </Output>
  <RegressionTable intercept="-5.18924891969128">
   <NumericPredictor name="quantity" exponent="1" coefficient="0.0128484453941352"/>
   <NumericPredictor name="total_component_weight" exponent="1" coefficient="12.0357979395919"/>
   <NumericPredictor name="quantity_cost_mean" exponent="1" coefficient="0.500814050845585"/>
   <NumericPredictor name="mat_quantity_cost_mean" exponent="1" coefficient="0.556822746464491"/>
   <NumericPredictor name="solid_volume" exponent="1" coefficient="0.000197314943339284"/>
  </RegressionTable>
 </RegressionModel>
</PMML>''')
data = pd.DataFrame({
    'quantity': [1.0,2.0,5.0,10.0,25.0,50.0,100.0,250.0,1.0,2.0],
    'total_component_weight': [0.018, 0.018, 0.018, 0.018, 0.018, 0.018, 0.018, 0.018, 0.018, 0.018],
    'quantity_cost_mean': [32.2903337,17.2369194,10.8846412,6.82802948,4.84356482,4.43703495,4.22259101,4.1087198,32.2903337,17.2369194],
    'mat_quantity_cost_mean': [20.4437141,12.0418426,7.22744702,4.3580642 ,3.09218161,2.74377648,2.5990824 ,2.53432422,20.4437141,12.0418426],
    'solid_volume': [1723.48653,1723.48653,1723.48653,1723.48653,1723.48653,1723.48653,1723.48653,1723.48653,1723.48653,1723.48653]
})
result = model.predict(data)

The result is:

    Predicted_cost
0   22.935291
1   10.730825
2   4.907295
3   1.342192
4   -0.163801
5   -0.240186
6   0.214271
7   2.048450
8   22.935291
9   10.730825
PredictFuture
  • 216
  • 2
  • 6