2

Random Forest accepts numerical data. Usually features with text data is converted to numerical categories and continuous numerical data is fed as it is without discretization. How the RF treat the continuous data for creating nodes? Will it bin the continuous numerical data internally? or treat each data as discrete level.

for example: I want to feed a data set(ofcourse after categorizing the text features) to RF. How the continuous data is handled by the RF? Is it advisable to discretize the continuous data(longitudes and latitudes, in this case) before feeding? Or doing so information is lost?

image showing the continuous data in longitude and latitude features

Sachinmm
  • 551
  • 1
  • 5
  • 8

2 Answers2

3

As far as I understand, you are asking how the threshold is chosen for continuous features. The binning occurs at values, where your class is changed. For example, consider the following 1D dataset with x as feature and y as class variable

x = [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [ 1, 1, 0, 0, 0, 0, 0, 1, 1, 1]

The two possible candidate cuts will be considered: (i) between 2 and 3 (will practically look like as x<2.5) and (ii) between 7 and 8 (as x<7.5). Among these two candidates the second one will be chosen since it provides a better separation. Them the algorithm goes to the next step.

Therefore it is not advisable to discretize the data yourself. Think about this with the data above. If, for example, you discretize the data in 5 bins [1, 2 | 3, 4 | 5, 6 | 7, 8 | 9, 10], you miss the best split (since 7 and 8 will be in one bin).

lanenok
  • 2,699
  • 17
  • 24
1

You are asking about DecisionTrees. Because RandomForest is ensemble model, and by itself it don't know anything about data, it fully relies on decisons from base estimators (In this case DecisionTrees), and aggregates them.

So, how DecisionTree is treating continious features: Look at this official documentation page. DecisionTreeClassifier was fitted on continuous dataset (Fisher irises), if you will look at the picture of tree - it has threshold value in each node over some chosen feature at this node.

Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52