1

I would like to clarify my understanding of the result of a trained M5P model. I trained an M5P model that gave me a tree followed by 4 linear models.

M5 unpruned model tree:
(using smoothed linear models)

Value12 <= 2.266 : 
|   Value2 <= 1111.5 : LM1 (2/0.01%)
|   Value2 >  1111.5 : LM2 (4/2.268%)
Value12 >  2.266 : 
|   Value3 <= 1544650 : LM3 (2/1.652%)
|   Value3 >  1544650 : LM4 (2/92.017%)

LM num: 1
Value15 = 
    -0.0001 * Value2 
    + 1.8377

LM num: 2
Value15 = 
    -0.0001 * Value2 
    + 1.8181

LM num: 3
Value15 = 
    -0 * Value3 
    + 1.7212

LM num: 4
Value15 = 
    -0 * Value3 
    + 1.7093

Number of Rules : 4

In order to make sure that I understood the working principle, I tried to manually replicate the result using the decision tree and the referenced LM model but the result were not as expected.

I used the tree to determine which LM model to use and I performed the operation as stated in the LM model and the results were not the same. Is that normal?

The dataset I used:

Data_train<-structure(list(Value2 = c(610L, 1245L, 978L, 610L, 978L, 610L, 
1727L, 1810L, 1805L, 1805L), Value3 = c(1544673L, 2206981L, 2512821L, 
1544627L, 2512792L, 1524144L, 3415598L, 9205162L, 9182166L, 9182089L
), Value4 = c(12.1260004043579, 17.3250007629395, 19.7259998321533, 
12.125, 19.7250003814697, 11.9650001525879, 26.8120002746582, 
72.2610015869141, 72.0800018310547, 72.0790023803711), Value5 = 
c(0.0817999988794327, 
0.0856000036001205, 0.0828000009059906, 0.0817999988794327, 
0.0828000009059906, 
0.09009999781847, 0.145199999213219, 0.200299993157387, 0.200299993157387, 
0.200200006365776), Value6 = c(2L, 1L, 2L, 2L, 2L, 2L, 4L, 4L, 
4L, 4L), Value7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    Value8 = c(4L, 4L, 4L, 4L, 4L, 4L, 22L, 36L, 36L, 36L), Value9 = c(1L, 
    1L, 2L, 1L, 2L, 1L, 8L, 6L, 6L, 6L), Value10 = c(0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Value11 = c(0.958189010620117, 
    1, 0.925986051559448, 0.958268105983734, 0.926032960414886, 
    0.971082329750061, 0.471057742834091, 0.476771682500839, 
    0.47670641541481, 0.47671303153038), Value12 = c(3.27869, 
    0.80321, 2.04499, 3.27869, 2.04499, 3.27869, 2.31616, 2.20994, 
    2.21607, 2.21607), Value13 = c(1L, 0L, 1L, 1L, 1L, 1L, 2L, 
    3L, 3L, 3L), Value15 = c(1.33398258686066, 1.90592515468597, 
    2.17005920410156, 1.33387243747711, 2.1699492931366, 1.31627094745636, 
    0.353617042303085, 1.93668437004089, 1.93183350563049, 1.93180668354034
    )), .Names = c("Value2", "Value3", "Value4", "Value5", "Value6", 
"Value7", "Value8", "Value9", "Value10", "Value11", "Value12", 
"Value13", "Value15"), row.names = c(NA, 10L), class = "data.frame") 

Here is the formula I used to train the model:

library(RWeka)
Data_modelUnPruned <- M5P(Value15 ~ Value6 + Value3 + Value4 + Value2 + 
Value7 + Value8 + Value9 + Value10 + Value11 + Value12 + Value13, data = 
Data_train, control = Weka_control(N = TRUE))

Here is the resulting dataset after having added the prediction column:

Data_train_Results<-structure(list(Value2 = c(610L, 1245L, 978L, 610L, 978L, 
610L, 
1727L, 1810L, 1805L, 1805L), Value3 = c(1544673L, 2206981L, 2512821L, 
1544627L, 2512792L, 1524144L, 3415598L, 9205162L, 9182166L, 9182089L
), Value4 = c(12.1260004043579, 17.3250007629395, 19.7259998321533, 
12.125, 19.7250003814697, 11.9650001525879, 26.8120002746582, 
72.2610015869141, 72.0800018310547, 72.0790023803711), Value5 = 
c(0.0817999988794327, 
0.0856000036001205, 0.0828000009059906, 0.0817999988794327, 
0.0828000009059906, 
0.09009999781847, 0.145199999213219, 0.200299993157387, 0.200299993157387, 
0.200200006365776), Value6 = c(2L, 1L, 2L, 2L, 2L, 2L, 4L, 4L, 
4L, 4L), Value7 = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), 
    Value8 = c(4L, 4L, 4L, 4L, 4L, 4L, 22L, 36L, 36L, 36L), Value9 = c(1L, 
    1L, 2L, 1L, 2L, 1L, 8L, 6L, 6L, 6L), Value10 = c(0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L), Value11 = c(0.958189010620117, 
    1, 0.925986051559448, 0.958268105983734, 0.926032960414886, 
    0.971082329750061, 0.471057742834091, 0.476771682500839, 
    0.47670641541481, 0.47671303153038), Value12 = c(3.27869, 
    0.80321, 2.04499, 3.27869, 2.04499, 3.27869, 2.31616, 2.20994, 
    2.21607, 2.21607), Value13 = c(1L, 0L, 1L, 1L, 1L, 1L, 2L, 
    3L, 3L, 3L), Value15 = c(1.33398258686066, 1.90592515468597, 
    2.17005920410156, 1.33387243747711, 2.1699492931366, 1.31627094745636, 
    0.353617042303085, 1.93668437004089, 1.93183350563049, 1.93180668354034
    ), Model_Prediction = c(1.56039428073199, 1.74959163286097, 
    1.77758972532522, 1.57231876013397, 1.77758972532522, 1.57429264935954, 
    1.38009848913172, 1.71850280973615, 1.71877793206469, 1.71877793206469
    )), .Names = c("Value2", "Value3", "Value4", "Value5", "Value6", 
"Value7", "Value8", "Value9", "Value10", "Value11", "Value12", 
"Value13", "Value15", "Model_Prediction"), row.names = c(NA, 
10L), class = "data.frame")

Here is the code I used to try to replicate the model results, it's basically the hard coded version of the M5P model in visual basic.

Public Function GetLM(Value2 As Long, Value3 As Long, Value4 As Double, 
Value6 As Long, Value7 As Long, Value8 As Long, Value9 As Long, Value10 As 
Long, Value11 As Double, Value12 As Double, Value13 As Long)
Dim lm As String

If Value12 <= 2.266 Then
    If Value2 <= 1111.5 Then
        lm = "LM1" '(2/0.019%)
    Else
        lm = "LM2" '(4/2.269%)
    End If
Else
    If Value3 <= 1544650 Then
        lm = "LM3" '(2/1.652%)
    Else
        lm = "LM4" '(2/92.021%)
    End If
End If

Select Case lm
        Case "LM1"
            GetLM = -0.0001 * Value2 _
                    + 1.8377
        Case "LM2"
            GetLM = -0.0001 * Value2 _
                    + 1.8181
        Case "LM3"
            GetLM = -0 * Value3 _
                    + 1.7212
        Case "LM4"
            GetLM = -0 * Value3 _
                    + 1.7093
        Case Else
            GetLM = 0
End Select
End Function

Can someone explain to me how this should work?

Thank you very much.

DavBig
  • 35
  • 1
  • 1
  • 7
  • Hi, thank you for your fast answer! Would you like the entire dataset on which I did the training?(593 lines) And the formula I used ?Or just a couple of lines and their predicted values? – DavBig Aug 22 '17 at 18:28
  • `Value2 Value3 Value4 Value5 Value6 Value7 Value8 Value9 Value10 Value11 Value12 Value13 Value15 Model predict 1 610 1544673 12.1260004 0.081799999 2 0 4 1 0 0.958189011 3.27869 1 1.333982587 1.28486197 2 1245 2206981 17.32500076 0.085600004 1 0 4 1 0 1 0.80321 0 1.905925155 1.171047248 3 978 2512821 19.72599983 0.082800001 2 0 4 2 0 0.925986052 2.04499 1 2.170059204 1.229475265 4 610 1544627 12.125 0.081799999 2 0 4 1 0 0.958268106 3.27869 1 1.333872437 1.284881181` – DavBig Aug 23 '17 at 17:31
  • Sorry for the formatting... this is my first question.... Those are four lines plus the header. The last column is the model prediction. All the lines would call the LM1 shown in the question. Again thank you very much for your help... If there would be any other way to transfer you the data, please let me know! – DavBig Aug 23 '17 at 17:36
  • Ok thank you I'll check on that. – DavBig Aug 23 '17 at 17:38
  • 1
    @Hack-R I updated my question with regards to your comments. I hope this to be more helpful. Thank you. – DavBig Aug 23 '17 at 19:25
  • Yes, much better. I am wrapping up at work, but I will try to solve your question this evening if possible. – Hack-R Aug 23 '17 at 19:43
  • OK, so, I've loaded your data and ran your model. Looks good. If you were to follow the description of the decision tree precisely, you should be able to replicate the results. Since that didn't happen in your case, we need to look for differences between what you did (i.e. did you do smoothing in the same way? etc) and what RWeka is doing under the hood. If nothing else, one difference could be that the regression functions you might use in R are different from those in Weka. I use both R and (Java) Weka, so I can help you investigate that if you provide your R attempt to recreate the tree. – Hack-R Aug 24 '17 at 13:54
  • I am not sure of what you mean by "provide your R attempt to recreate the tree" but here is what I did: First, I use R Tools (Microsoft R Open 3.3.3.0) in Visual Studio 2015 Pro. It is where I got my first result that I showed you. So I did rerun the script with with the same data and got the same result. Then I did ran the same R script in RGui that runs R 3.3.3 and got the same result again. And finally I ran the same script again but this time in Weka Explorer and got the same result. Therefore I still cannot yield the same result doing the math with the given the tree and LM's Thank you! – DavBig Aug 25 '17 at 12:47
  • Also, in my reading, I came across that phrase "Note that the final models that are output by M5P are the “smoothed” leaf node models (unless you have turned smoothing off). Smoothing produces a linear combination of all linear models along the path from the corresponding leaf node to the root node of the tree." here [link](http://weka.8497.n7.nabble.com/M5P-Model-Tree-Attribute-Selection-td39399.html) I wonder if this is some kind of hidden layer. Did you have the same results as me when you ran the script? – DavBig Aug 25 '17 at 12:59
  • By provide your attempt I mean show the code you ran to try to reproduce the result so that we can find the differences between your code and what RWeka did. – Hack-R Aug 25 '17 at 14:36
  • I basically reproduce the model tree to get the right linear model, and reproduce all the LM's to get the value. The code is in visual basic. – DavBig Aug 25 '17 at 18:17
  • OK thanks. That was all that was missing. I will try to find more time to work on this later today. – Hack-R Aug 25 '17 at 18:49

0 Answers0