So in a GBM, each tree predicts the 'pseudo-residuals' of the prior tree [1].
I'm not sure exactly how these 'pseudo-residuals' work but I wonder how this plays out when you have a combination of:
- Binary classification problem
- Low response rate
- A reasonably low signal-to-noise ratio
In the example below, we have all 3. I calculate residuals as Actual - Probability
and since the response is binary, you end up with this highly bi-modal distribution which is nearly identical to the response.
Decreasing the response rate further exacerbates the bi-modal distribution since the probabilities are closer to zero and, hence, the distributions are even closer to either 0 or 1.
So I have a few questions here:
- How exactly would pseudo residuals be calculated in this example? (I am fairly sure this is wrong, aside from just the fact that the initial tree models difference from the global mean)
- Would the second tree be nearly identical to the first as a result?
- Are successive trees in a GBM more similar for problems with lower response rates?
- Does down-sampling on non-response inherently change the model as a result?
.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import roc_auc_score
train_percent = 0.8
num_rows = 10000
remove_rate = 0.1
# Generate data
X, y = make_classification(n_samples=num_rows, flip_y=0.55)
# Remove response rows to make sample unbalanced
remove = (np.random.random(len(y)) > remove_rate) & (y == 1)
X, y = X[~remove], y[~remove]
print("Response Rate: " + str(sum(y) / float(len(y))))
# Get train/test samples (data is pre-shuffled)
train_rows = int(train_percent * len(X))
X_train , y_train = X[:train_rows], y[:train_rows]
X_test , y_test = X[train_rows:], y[train_rows:]
# Fit a simple decision tree
clf = DecisionTreeClassifier(max_depth=4)
clf.fit(X_train, y_train)
pred = clf.predict_proba(X_test)[:,1]
# Calculate roc auc
roc_auc = roc_auc_score(y_test, pred)
print("ROC AUC: " + str(roc_auc))
# Plot residuals
plt.style.use('ggplot')
plt.hist(y_test - pred);
plt.title('Residuals')