0

I have a data set of 1k records and my job is to do a decision algorithm based on those records. Here is what I can share:

  1. The target is a continuous value.

  2. Some of the predictors (or attributes) are continuous values, some of them are discrete and some are arrays of discrete values (there can be more than one option)

My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.

My question is: Am I on the right path? Is there a better way to do that?

rcs
  • 67,191
  • 22
  • 172
  • 153
SdSdsdsd
  • 123
  • 10

1 Answers1

0

I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.

Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.

user1544100
  • 33
  • 1
  • 4