I'm trying to implement a GP algorithm.
This is my first time using numpy/pandas/swifter, so perhaps that explains a lot.
I have a tree class (only relevant methods shown):
FUNCTIONS = [np.add, np.subtract, np.multiply]
ATTRIBUTES = ["Distance", "Haversine", "Temp", "Wind", "Humid", "Snow", "Dust"]
class GPTree:
def __init__(self, data=None, left=None, right=None):
self.data = data
self.left = left
self.right = right
def compute_tree(self, row: pd.DataFrame):
if self.data in FUNCTIONS:
return self.data(self.left.compute_tree(row), self.right.compute_tree(row))
elif self.data in ATTRIBUTES:
return row[self.data]
else:
return self.data
For each tree, I need to run each row in the testing set against the tree. At the moment I've only got this to work: (In each row, row[0] is the value I'm trying to predict with the attributes in ATTRIBUTES)
def error(individual, dataset):
return dataset.swifter.apply(lambda row: abs(individual.compute_tree(row) - row[0]), axis=1).mean()
Which is not faster than dataset.apply(...) since (by my understanding) compute_tree
is not a vectorised function. How can I vectorise compute_tree
or otherwise make this part of my program faster?
Alternatively is there some other representation I can use which is faster? I have a dataset with ~10M lines, at my current speed it will be impossible to run the GP algorithm with 10k lines.
I tried to write a vectorised version:
def error():
return dataset.swifter.apply(lambda row: comp_vec(individual, row), axis=1).mean()
def comp_vec(individual, row):
return np.abs(np.subtract(individual.compute_tree_vec(row), row[0]))
def compute_tree_vec(self, row):
return np.where(
self.data in FUNCTIONS,
self.data(self.left.compute_tree_vec(row), self.right.compute_tree_vec(row)),
np.where(self.data in ATTRIBUTES, row[self.data], self.data)
)
This did not work with the error AttributeError: 'NoneType' object has no attribute 'compute_tree_vec'
.
I tried to automatically vectorise compute_tree
:
def tree_to_numpy(tree):
def func(row):
return tree.compute_tree(row)
return np.vectorize(func)
def error(individual, dataset):
v = tree_to_numpy(individual)
return dataset.swifter.apply(lambda row: np.abs(np.subtract(v(row), row[0])), axis=1).mean()
This did not work with the error:
File "/Users/arul/code/gp/main.py", line 100, in compute_tree
return row[self.data]
IndexError: invalid index to scalar variable.
I've tried searching for both these error and implementing what I've found, but nothing has worked.
Another idea I thought of was to somehow "compile" the tree to a numpy vectorised function. Like a tree that is for example (mul (add distance snow) temp) will become np.multiply(np.add(distance, snow), temp) so that it can be run in a vectorised fashion, but I have no idea how to do that. (like substituting the name into the function and still running it?)
Code with sample of dataset: https://gist.github.com/arulagrawal/576b722d252e8a9110a73502ecb1718a