I am trying to use PyGAD to optimize hyper-parameters in ML models. According to documentation
The gene_space parameter customizes the space of values of each gene ... list, tuple, numpy.ndarray, or any range like range, numpy.arange(), or numpy.linspace: It holds the space for each individual gene. But this space is usually discrete. That is there is a set of finite values to select from.
As you can see, the first element of gene_space
, which corresponds to solution[0]
in the Genetic Algorithm definition, is an array of integers. According to documentation, this should be a discrete space, which it is. However, when this array of integers (from np.linspace
, which is okay to use), it is interpreted by Random Forest Classifier as a numpy.float64'>
(see error in 3rd code block.)
I don't understand where this change of data type is occurring. Is this a PyGAD problem and how can I fix? Or is it a numpy -> sklearn problem?
gene_space = [
# n_estimators
np.linspace(50,200,25, dtype='int'),
# min_samples_split,
np.linspace(2,10,5, dtype='int'),
# min_samples_leaf,
np.linspace(1,10,5, dtype='int'),
# min_impurity_decrease
np.linspace(0,1,10, dtype='float')
]
The definition of the Genetic Algorithm
def fitness_function_factory(data=data, y_name='y', sample_size=100):
def fitness_function(solution, solution_idx):
model = RandomForestClassifier(
n_estimators=solution[0],
min_samples_split=solution[1],
min_samples_leaf=solution[2],
min_impurity_decrease=solution[3]
)
X = data.drop(columns=[y_name])
y = data[y_name]
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.5)
train_idx = sample_without_replacement(n_population=len(X_train),
n_samples=sample_size)
test_idx = sample_without_replacement(n_population=len(X_test),
n_samples=sample_size)
model.fit(X_train.iloc[train_idx], y_train.iloc[train_idx])
fitness = model.score(X_test.iloc[test_idx], y_test.iloc[test_idx])
return fitness
return fitness_function
And the instantiation of the Genetic Algorithm
cross_validate = pygad.GA(gene_space=gene_space,
fitness_func=fitness_function_factory(),
num_generations=100,
num_parents_mating=2,
sol_per_pop=8,
num_genes=len(gene_space),
parent_selection_type='sss',
keep_parents=2,
crossover_type="single_point",
mutation_type="random",
mutation_percent_genes=25)
cross_validate.best_solution()
>>>
ValueError: n_estimators must be an integer, got <class 'numpy.float64'>.
Any recommendations on resolving this error?
EDIT: I've tried the below to successful results:
model = RandomForestClassifier(n_estimators=gene_space[0][0])
model.fit(X,y)
So the issue does not lie with numpy->sklearn but with PyGAD.