5

I've tried several different methods, some of which I found on here which include making a Node class and nested dictionaries, but I can't seem to get them to work.

My code currently takes in several lines of DNA (a,t,g,c) and stores then as a numpy array. It then finds the attribute that gives the most gain and splits the data into 4 new numpy arrays (dependent upon an a, t, g, or c being present at the attribute).

I'm unable to make a recursive function which can build the tree. I'm quite new to python and programming itself, so please describe with detail what I should do.

Thanks for any help

user3312146
  • 53
  • 1
  • 3

3 Answers3

6

If you want to implement a decision tree from scratch I recommend you to build your tree using classes. A tree is composed of nodes, where one node contains nodes recursively and leafs are terminal nodes. For the case of a binary tree, these classes can be something like:

class Node(object):
    def __init__(self):
        self.split_variable = None
        self.left_child = None
        self.right_child = None

    def get_name(self):
        return 'Node'

class Leaf(object):
    def __init__(self):
        self.value = None

    def get_name(self):
        return 'Leaf'

For the Node class: 'split_variable' will contain the variable name used in the split ie: [a,t,g,c] and 'left_child' and 'right_child' will be new instances of Node or Leaf. The True/False presence of that variable will be mapped into the left/right children. (In case of a regression tree you'll need to add a fourth variable to the Node class 'split_value' and map less/more than this value into the left/right children).

For the Leaf class: 'value' contains the assigned value of the tree class variable (ie majority in case of a discrete variable or mean in the case of a continuous one).

To complete your implementation you'll need functions to walk your tree evaluating and/or visualising it. These functions will be recursively called to complete walking through the tree. Here is where you can make use of the get_name() functions of the classes, to differentiate nodes from leafs. To implement this part it really depends on how you store your data, I suggest you to use pandas DataFrames which are like tables. A sample evaluate function could be (pseudocode):

def evaluate_tree(your_data, node):
    if your_data[node.split_variable]:
        if node.left_child.get_name() == 'Node':
            evaluate_tree(your_data, node.left_child)
        elif node.left_child.get_name() == 'Leaf':
            return node.left_child.value
    else:
        if node.right_child.get_name() == 'Node':
            evaluate_tree(your_data, node.right_child)
        elif node.right_child.get_name() == 'Leaf':
            return node.right_child.value

Good luck!

prl900
  • 4,029
  • 4
  • 33
  • 40
2

probably dict is what you want:

an example of node is:

{'sex': {'yes': 'send email', 'no': 'not send email'}}
pinseng
  • 301
  • 2
  • 6
  • 11
1

If you are looking to use a decision tree with python you can use the decision tree module from Sci-kit learn rather than write your own decision tree class and logic: http://scikit-learn.org/stable/modules/tree.html. Using the Scikit Learn decision tree module you can save the decision tree objects to memory or perhaps write certain attributes of the tree to a file or database.

Sci-kit learn, as well as the other python libraries that are a part of the Anacondas package are pretty much the standard in data exploration and analysis in python. You can get the Anaconda package from Continuum here: http://continuum.io/downloads

EDIT 1

I came across this on Hacker News. It's about building a decision tree in Python using PostgreSQL as the database you pull values from. Might be interesting to checkout: http://www.garysieling.com/blog/building-decision-tree-python-postgres-data

Chris Clouten
  • 1,075
  • 3
  • 11
  • 24
  • This is what I want, but I really would like to learn how to implement the decision tree myself. I've asked some fellow programmers and they suggest using classes. However, I'm still a bit oblivious as to how to implement a "class Node:" in order to get my desired outcome. – user3312146 Feb 16 '14 at 08:45
  • A great place to start is download the scikit learn source code and look at how they implement decision trees -- it's something I've done before with Numpy and matrix multiplication. You probably wont write code that is as fast or as optimized as scikit learns, but you'll understand how it's implemented. – Chris Clouten Feb 17 '14 at 00:46