I am doing this as a part of my university assignment, but I can't find any resources online on how to correctly implement this. I have read tons materials on metrics that define optimal set split (like Entropy, Gini and others), so I understand how we would choose an optimal value of feature to split learning set into left and right nodes.
However what I totally don't get is the complexity of implementation, considering we also have to choose optimal feature, which means that on each node to compute optimal value it would take O(n^2), which is bad considering real ML datasets are shaped about 10^2 x 10^6, this is really big in terms of computation cost.
Am I missing some kind of approach that could be used here to help reduce complexity?
I currently have this baseline implementation for choosing best feature and value to split on, but I really want to make it better:
for f_idx in range(X_subset.shape[1]):
sorted_values = X_subset.iloc[:, f_idx].sort_values()
for v in sorted_values[self.min_samples_split - 1 : -self.min_samples_split + 1]:
y_left, y_right = self.make_split_only_y(f_idx, v, X_subset, y_subset)
if threshold is not None:
G = calc_g(y_subset, y_left, y_right)
if G < tr_G:
threshold = v
feature_idx = f_idx
tr_G = G
else:
threshold = v
feature_idx = f_idx
tr_G = G
return feature_idx, threshold