0

I am trying to use StreamingLogisticRegressionwithSGD to build a CTR prediction model.

The document is here

mentions that the numFeatures should be constant.

The problem that I am facing is : Since most of my variables are categorical, the numFeatures variable should be the final set of variables after encoding and parsing the categorical variables in labeled point format.

Suppose, for a categorical variable x1 I have 10 distinct values in current window.

But in the next window some new values/items gets added to x1 and the number of distinct values increases. How should I handle the numFeatures variable in this case, because it will change now ?

Basically, my question is how should I handle the new values of the categorical variables in streaming model.

Thanks, Kundan

neer
  • 4,031
  • 6
  • 20
  • 34
Kundan Kumar
  • 1,974
  • 7
  • 32
  • 54

1 Answers1

0

You should fill the missing columns with zero values and discard any newly encountered values in each window to make sure the number of remains the same as when used for training.

Lets consider a column city having the values [NewYork, Paris, Tokyo] in the training set. This would result in three columns.

If during prediction you find the values [NewYork, Paris, Chicago, RioDeJaneiro] you should discard the values Chicago and "RioDeJaneiro" then fill zero value for the column corresponding to "Tokyo" such that the result still has three columns (one for each of [NewYork, Paris, Tokyo] ).

shanmuga
  • 4,329
  • 2
  • 21
  • 35
  • My main concern was how do I handle the new values of categorical variables in new batch (training data) . In the prev batch I had say [NewYork, Paris, Tokyo] and in the current batch the values are [NewYork, Paris, Chicago, RioDeJaneiro]. Since the numfeatures should be constant in streaming logistic regression model., how should I handle these new values ? Thanks ! – Kundan Kumar Jul 12 '16 at 11:53