0

I'm trying to implement a GP algorithm.

This is my first time using numpy/pandas/swifter, so perhaps that explains a lot.

I have a tree class (only relevant methods shown):

FUNCTIONS = [np.add, np.subtract, np.multiply]
ATTRIBUTES = ["Distance", "Haversine", "Temp", "Wind", "Humid", "Snow", "Dust"]

class GPTree:
    def __init__(self, data=None, left=None, right=None):
        self.data = data
        self.left = left
        self.right = right

    def compute_tree(self, row: pd.DataFrame):
        if self.data in FUNCTIONS:
            return self.data(self.left.compute_tree(row), self.right.compute_tree(row))
        elif self.data in ATTRIBUTES:
            return row[self.data]
        else:
            return self.data

For each tree, I need to run each row in the testing set against the tree. At the moment I've only got this to work: (In each row, row[0] is the value I'm trying to predict with the attributes in ATTRIBUTES)

def error(individual, dataset):
    return dataset.swifter.apply(lambda row: abs(individual.compute_tree(row) - row[0]), axis=1).mean()

Which is not faster than dataset.apply(...) since (by my understanding) compute_tree is not a vectorised function. How can I vectorise compute_tree or otherwise make this part of my program faster?

Alternatively is there some other representation I can use which is faster? I have a dataset with ~10M lines, at my current speed it will be impossible to run the GP algorithm with 10k lines.


I tried to write a vectorised version:

def error():
    return dataset.swifter.apply(lambda row: comp_vec(individual, row), axis=1).mean()


def comp_vec(individual, row):
    return np.abs(np.subtract(individual.compute_tree_vec(row), row[0]))

def compute_tree_vec(self, row):
        return np.where(
            self.data in FUNCTIONS,
            self.data(self.left.compute_tree_vec(row), self.right.compute_tree_vec(row)),
            np.where(self.data in ATTRIBUTES, row[self.data], self.data)
        )

This did not work with the error AttributeError: 'NoneType' object has no attribute 'compute_tree_vec'.


I tried to automatically vectorise compute_tree:

def tree_to_numpy(tree):
    def func(row):
        return tree.compute_tree(row)
    return np.vectorize(func)

def error(individual, dataset):
    v = tree_to_numpy(individual)
    return dataset.swifter.apply(lambda row: np.abs(np.subtract(v(row), row[0])), axis=1).mean()

This did not work with the error:

File "/Users/arul/code/gp/main.py", line 100, in compute_tree
    return row[self.data]
IndexError: invalid index to scalar variable.

I've tried searching for both these error and implementing what I've found, but nothing has worked.

Another idea I thought of was to somehow "compile" the tree to a numpy vectorised function. Like a tree that is for example (mul (add distance snow) temp) will become np.multiply(np.add(distance, snow), temp) so that it can be run in a vectorised fashion, but I have no idea how to do that. (like substituting the name into the function and still running it?)


Code with sample of dataset: https://gist.github.com/arulagrawal/576b722d252e8a9110a73502ecb1718a

0 Answers0