0

I've trained a simple logistic regression model in SSAS, using Gender and NIC as discrete input nodes (NIC is 0 for non-smoker, 1 for smoker) with Score (0-100) as a continuous output node.

I want to predict the score based on a new participant's values for Gender and NIC. Of course, I can run a singleton query in DMX; for example, the following produces a value of 49.51....

  SELECT Predict(Score) 
  FROM [MyModel]
  NATURAL PREDICTION JOIN 
  (SELECT 'M' AS Gender, '1' AS NIC) as t

But instead of using DMX, I want to create a formula from the model in order to calculate scores while "disconnected" from SSAS.

Investigating the model, I have the following information in the NODE_DISTRIBUTION of the output node:

  ATTRIBUTE_NAME   ATTRIBUTE_VALUE    SUPPORT PROBABILITY    VARIANCE       VALUETYPE
  Gender:F         0.459923854        0       0              0              7 (Coefficient)
  Gender:M         0.273306289        0       0              0              7 (Coefficient)
  Nic:0            -0.282281195       0       0              0              7 (Coefficient)
  Nic:1            -0.802106901       0       0              0              7 (Coefficient)
                   0.013983007        0       0              0.647513829    7 (Coefficient)
  Score            75.03691517        0       0              0              3 (Continuous

Plugging these coefficients into a logistic regression formula -- that I am being disallowed from uploading as a new user :) -- for the smoking male example above,

  f(...) = 1 / (1 + exp(0 - (0.0139830071136734   -- Constant(?)
    + 0 * 0.459923853918008                       -- Gender:F = 0
    + 1 * 0.273306289390897                       -- Gender:M = 1
    + 1 * -0.802106900621717                      -- Nic:1 = 1
    + 0 * -0.282281195489355)))                   -- Nic:0 = 0

results in a value of 0.374.... But how do I "map" this value back to the score distribution of 0-100? In other words, how do I extend the equation above to produce the same value that the DMX singleton query does? I'm assuming it will require the stdev and mean of my Score distribution, but I'm stuck on exactly how to use those values. I'm also unsure whether I'm using the ATTRIBUTE_VALUE in the fifth row correctly as the constant.

Any help you can provide will be appreciated!

hoss
  • 2,430
  • 1
  • 27
  • 42
hbeam
  • 187
  • 3
  • 9

2 Answers2

1

I'm no expert, but it sounds to me you don't want to use logistic regression at all. You want to train a linear regression. You currently have a logistic regression model, these are typically used for binary classification, not continuous values, i.e., 0-100.

How to do linear regression in SAS

Wikipedia: linear regression

more details: the question really depends, like most datamining/machine learing problems, on your data. If your data is bimodal, more than 90% of the training set is very close to either 1 or 100, then a logistic regression MIGHT be used. The equation used in logistic regression is specifically designed to render YES/NO answers. It is technically a continuous function, therefore results such as .34 are possible, but they are statistically very unlikely (in typical usage you would round down to 0).

However, if your data is normally distributed (most of nature is) the better method is linear regression. Only problem is it CAN predict outside of your range 0-100, if given a particularly bad data point. In this case you would be best off rounding (clipping the result to 0-100) or ignore the data point as an outlier. In the case of gender, a quick hack would be to map male to 0 and female to 1, then treat gender as an input for the model.

SSAS linear regression

Harry Moreno
  • 10,231
  • 7
  • 64
  • 116
  • Thanks Harry, I purposefully made a very simple example just so I could get my head around the problem. I'm actually stuck with the logistic regression model, but at least in SSAS, it does seem to support continuous values as an output, with the advantage that the formula constrains the output within 0-1, if I'm understanding: . [Logistic Regression](http://msdn.microsoft.com/en-us/library/cc645904) – hbeam Jun 15 '12 at 03:16
  • yes, but the S-curve used is specifically intended to render a 0 or 1 (technically it is continuous due to mathematical properties, but simply scaling the result of this model is probably NOT what you want). In most cases, if you're not doing YES/NO classification, you probably want a linear regression. The problem then is if it predicts outside the range 1-100. You must address this by either classifying those instances as outliers or rounding (down to 100 or up to 1) in software. – Harry Moreno Jun 15 '12 at 03:25
0

You do not want to be using logistic regression if you are trying to model a score restricted to an interval [0,100]. Logistic regression is used to model either binary data or proportions based on a binomial distribution. Assuming a logit link function what you are actually modelling with logistic regression is a function of probability (log of odds) and as such the entire process is geared to give you values in the interval [0,1]. To try to use this to map to a score does not seem to be the right type of analysis at all.

In addition I cannot see how regular linear regression will help you either as your fitted model will be capable of generating values way outside of your target interval [0,100] and if you are having to perform ad hoc truncation of values to this range then can you really be sure that your data has any effective meaning??

I would like to be able to point you to the type of analysis that you require but I have not encountered this type of analysis. My advice to you would be to abandon the logistic regression approach and consider joining the ALLSTAT mailing list used by professional statisticians and mathematicians and asking for advice there. Or something similar.

mathematician1975
  • 21,161
  • 6
  • 59
  • 101