I am trying to make a flex+bison scanner and parser for Newick file format trees in order to do operations on them. The implemented grammar an explanation is based on a simplification of (labels and lengths are always of the same type, returned by flex) this example.
This is esentially a parser for a file format which represents a tree with a series of (recursive) subtrees and/or leaves. The main tree will always end on ; and said tree and all subtrees within will contain a series of nodes between ( and ), with a name and a length to the right of the rightmost parenthesis specified by name and :length, which are optional (you can avoid specifying them, put one of them (name or :length), or both with name:length).
If any node lacks either the name or a length, default values will be applied. (for example: 'missingName' and '1')
An example would be (child1:4, child2:6)root:6; , ((child1Of1:2, child2Of1:9)child1:5, child2:6)root:6;
The implementation of said grammar is the following one (NOTE: I translated my own code, as it was in my language, and lots of side stuff got removed for clarity):
struct node {
char* name; /*the node's assigned name, either from the file or from default values*/
float length; /*node's length*/
} dataOfNode;
}
%start tree
%token<dataOfNode> OP CP COMMA SEMICOLON COLON DISTANCE NAME
%type<dataOfNode> tree subtrees recursive_subtrees subtree leaf
%%
tree: subtrees NAME COLON DISTANCE SEMICOLON {} // with name and distance
| subtrees NAME SEMICOLON {} // without distance
| subtrees COLON DISTANCE SEMICOLON {} // without name
| subtrees SEMICOLON {} // without name nor distance
;
subtrees: OP recursive_subtrees CP {}
;
recursive_subtrees: subtree {} // just one subtree, or the last one of the list
| recursive_subtrees COMMA subtree {} // (subtree, subtree, subtree...)
subtree: subtrees NAME COLON DISTANCE { $$.NAME= $2.name; $$.length = $4.length; $$.lengthAcum = $$.lengthAcum + $4.length;
} // group of subtrees, same as the main tree but without ";" at the end, with name and distance
| subtrees NAME { $$.name= $2.name; $$.length = 1.0;} // without distance
| subtrees COLON DISTANCE { $$.name= "missingName"; $$.length = $3.length;} // without name
| subtrees { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
| leaf { $$.name= $1.name; $$.length = $1.length;} // a leaf
leaf: NAME COLON DISTANCE { $$.name= $$.name; $$.length = $3.length;} // with name and distance
| NAME { $$.name= $1.name; $$.length = 1.0;} // without distance
| COLON DISTANCE { $$.name= "missingName"; $$.length = $2.length;} // without name
| { $$.name= "missingName"; $$.length = 1.0;} // without name nor distance
;
%%
Now, let's say that I want to distinguish who is the parent of each subtree and leaf, so that I can accumulate the length of a parent subtree with the length of the "longest" child, recursively.
I do not know if I chose a bad design for this, but I can't get past assigning names and lengths to each subtree (and leaf, which is also considered a subtree), and I don't think I understand either how recursivity works in order to identify the parents in the matching process.