1

We desire to create a pushdown automaton (PDA) that uses the following "alphabet" (by alphabet I mean a unique set of symbol strings/keys):

+aaa
+bbb
+ccc
+ddd
+eee
-aaa
-bbb
-ccc
-ddd
-eee

The "symbols" (3-letter sequences with a + or - prefix) in this alphabet are used to make trees. Here are some example trees:

+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb

Visualized as an actual tree, they would be more like:

+eee
-eee

+aaa
  +bbb
  -bbb
-aaa

+bbb
  +aaa
    +ccc
      +eee
      -eee
    -ccc
    +ccc
    -ccc
    +ddd
      +ccc
        +eee
        -eee
      -ccc
    -ddd
  -aaa
-bbb

So given this alphabet and these example trees, the question is how you would write a generic pushdown automaton to parse these trees. The rules are:

  1. Any letter pair (open/close pair) can have any number of nested children, and it doesn't matter what letter pairs are nested as children.

How would you write a pushdown automaton in JavaScript to parse a string into an AST?

By this I mean, the implementation must literally have a stack, states, and transitions. By this I mean, not implementing an ad-hoc parser, or even a recursive descent parser. This should be an iterative while loop looping through transitions somehow, not using recursion.

The output should be a very simple "AST" that just looks like this (for the +aaa+bbb-bbb-aaa tree):

{
  "type": "aaa",
  "children": [
    {
      "type": "bbb",
      "children": []
    }
  ]
}

I am wondering how to build a PDA so I can, in my specific case working on a custom programming language, convert a rather involved and complex AST I am working and parse it into an object graph. That question is too complicated to write out on SO, and too hard to simplify, which is why I am asking here about a very basic PDA implementation in JavaScript.

I am interested to see how you keep track of the context at each branch point when doing the PDA algorithm and what the transitions look like.

Note: I've heard of / seen "2-state" PDAs occasionally mentioned here and there. It sounds like you have one state to do the computation, and one "final" state. Maybe even just a single state. If it's possible to do it sort of like this, then that would be cool. If not, don't worry about it.

If necessary, you can first write a Context Free Grammar and use that to dynamically construct a PDA, but the key question here is how the PDA is technically implemented. I think it's probably possible / straightforward to just write each transition function by hand, but I'm not entirely sure. Either way works for me.

Lance
  • 75,200
  • 93
  • 289
  • 503
  • Are you after the JS code for a general deterministic PDA? Or are you after the PDA definition for your grammar? Or both? The former is likely readily available (it's just a transition map and a stack). The latter is just fancier parenthesis matching, also readily available. What specifically are you after and what have you tried and where are you stuck? – Welbog May 28 '21 at 10:40
  • I am after the JS code for a general deterministic PDA, using my grammar as an example for demonstration purposes. I get that it's fancy parentheses matching but I'd like to see how it's done. I have scanned basically every resource on PDAs and I don't understand how to construct the PDA in JS, and haven't found one in JS. I am stuck at the beginning. What does the transition map and stack look like implementation-wise? – Lance May 28 '21 at 11:51
  • Any constraints on how you want to build the AST? Like immutable data structures? – Bergi May 28 '21 at 12:29
  • @Bergi No constraints, I haven't considered that. Whatever is easiest to understand I guess. – Lance May 28 '21 at 12:34
  • 1
    "*How would you write a pushdown automaton in JavaScript to parse a string into an AST?*" - why are you trying to write a PDA? Why not have a look at parsers instead? You'd need to change the theoretical formalism of the PDA if you want to output something. – Bergi May 28 '21 at 12:43
  • @Bergi I want to build a suite of _grammars_ for parsing various kinds of markup formats, programming languages, and other file formats. I have a way of _defining_ all of these grammars, a way of encoding all the patterns in a nice succinct way in a custom programming language. Now the problem is how do I build a generic parser generator sort of thing to take my grammar DSL/AST ("JSON") and parse some string into a sourcecode/markup AST? I do not want to write a custom parser for each markup/language/format, I want it to be generic from the grammars. – Lance May 28 '21 at 12:51
  • My DSL for grammars is close to [this](https://stackoverflow.com/questions/67715007/how-to-convert-nested-function-call-tree-to-state-machine). I am looking for some algorithm to convert the DSL/grammar JSON into some parser algorithm, and I thought after that question (and the XML literature) that this would be a PDA. But if not, what else is a better direction to look in, what would be a better approach (other than manually building an ad-hoc parser for each)? – Lance May 28 '21 at 12:54
  • Since we're talking, in my ideal world I have this grammar DSL, and the parser that gets generated from it (or even just a recursive descent parser that works straight from the grammar itself) should have a great user experience when it comes to error messages and parsing while you type (like if it was used in a text editor). It should be able to figure out where errors are yet know you are in the middle of typing, but at the same time be able to parse fast a final string. Maybe those two are separate algorithms, I'm not that far yet. – Lance May 28 '21 at 12:57
  • 2
    So basically you're trying to reinvent XSLT? Or a parser generator? I don't see how this question of "*How to parse a (tagged) parenthesis tree?*" does help with that. – Bergi May 28 '21 at 13:00
  • Exactly, yes. But I haven't found any XSLT code that would help learn how to do this. I want to know intimately how it works by implementing it myself. Well, I am having trouble getting started though haha, hence asking here. [I wrote a recursive descent parser](https://github.com/grammarjs/recursive-parser/blob/master/index.js) a while back, with [grammars](https://github.com/grammarjs/javascript/blob/master/index.js) (of a different style), but I would like to do better. – Lance May 28 '21 at 13:00
  • 1
    It sounds like you're basically trying to design a compiler. This is a huge undertaking. Look into high-level articles on compiler design, specifically the main steps: tokenization, parsing and creating an execution tree. – Welbog May 28 '21 at 13:03
  • @Welbog yes I have been trying to create a compiler for about 3 years now, the first couple years focusing on parsing where I ended up just hacking something together and doing ad-hoc, but learned a lot about grammar design. I want to be able to use grammars now to make it more scalable and fun and nice. Once I can parse trees (functions are also implemented as trees in the language), I can compile the functions into code (JavaScript to start, as it's way easier than x86 compilation). I am just in the stage of ad-hoc compiling, but I need to not ad-hoc parse. – Lance May 28 '21 at 13:06
  • I got nowhere asking [how to do this](https://stackoverflow.com/questions/67441018/how-to-construct-tree-pattern-matching-algorithm-in-javascript) in a more ad-hoc way (parse function trees from simple AST (like XML AST) into an object graph (like a "code" AST) using a pattern tree / grammar). So now I thought of trying to approach it differently and do it in a more formal automaton sort of way. Now you're saying I'm completely off track (again) haha. Hard to figure out what is the right approach (where not reinventing the wheel is not an option) as a non CS person. – Lance May 28 '21 at 13:11
  • So, you *can* do this with something PDA-like, with an arbitrary transition function instead of a simple map. I'll show how in my answer. – Welbog May 28 '21 at 13:39
  • Updated. Note the transition function is still complete, using cascading ifs to catch the different possibilities instead of listing them all out. – Welbog May 28 '21 at 13:49
  • 1
    You may find [Reg Braithwaite's related article](https://raganwald.com/2019/02/14/i-love-programming-and-programmers.html) interesting reading. – Scott Sauyet May 28 '21 at 19:34
  • I wonder how you imagine a PDA to yield the AST? If it could, it would be in its state, but there is a contradiction then: your grammar can produce an infinite number of trees, yet a PDA has a finite number of states. Although a PDA has a stack that can grow without limit, it can only access the top element, and the stack elements should again be elements of a finite set. – trincot May 29 '21 at 12:41

5 Answers5

3

It is not possible to create a PDA for producing this tree data structure because:

  • a PDA has a finite number of states and stack symbols, and no output tape. Yet the number of trees that this PDA should be able to somehow represent (and make available) is infinite. If we consider the nodes of a tree in an object-oriented structure, then there are an infinite number of different nodes possible, as a node's identity is determined also by the references it has. This is in conflict with the finite set of symbols that a PDA has. If as alternative we would not go for an object-oriented output, then we gain nothing: the input already is a serialized representation of the tree.
  • Even if you would add an output tape to this automaton, making it a rewrite system, you wouldn't gain much, since the input is already representing the tree in a serialized format. If I understood correctly, you are not interested in serialized output (like JSON, which would be trivial as the output order is the same as the input order), but structured output (JavaScript-like objects).

So, as others have also noted, we need to loosen the rules somehow. We can think of several (combinations of) ways to do that:

  1. Allow for an infinite number of states, or
  2. Allow for an infinite stack alphabet, or
  3. Allow the PDA to have access to more of the stack than just its top element
  4. Let the transition table refer to a particular attribute of the stack's top element, not the element as a whole -- allowing infinite possibilities for other attributes of the stacked elements, while ensuring that the values of this particular attribute belong to a finite set.
  5. Keep the tree building outside of the PDA
  6. ...Any other measures for solving the incompatibility of PDA with the requirement to produce a structured tree.

Some of these options would infer an infinite transition table, and so in that case it cannot be a table, but could be a function.

Implementation choices

1. For the generic engine

Considering your focus on tree-building, and the requirement to have "a stack, states, and transitions", I went for the following implementation choices, and sometimes simplifications:

  • I take strategy 4 from the above list
  • The generic engine will itself create the tree. So it is only generic in a limited sense. It can only validate and generate trees.
  • It will (only) take a transition table as configuration input
  • It has only two states (boolean), which indicate whether all is OK so far, or not.
  • The OK-state can only go from true to false when there is no matching transition in the transition table
  • Once the OK-state is false, the engine will not even attempt to find a transition, but will just ignore any further input.
  • By consequence the transition table will not include the current state and future state parts, as they are both implied to be true (i.e. "OK").
  • The tree will be encoded in the stack. The data property of a stack element will be the special attribute that will serve for identifying transitions. The children attribute of each stack element will serve to define the tree and is not considered part of the stack alphabet that the transition table targets.
  • The stack starts out with one element. Its children attribute will be the output of the engine. At every step this output can be consulted, since this element is never removed from the stack.
  • The input symbols cannot include the empty string, which is reserved for signalling the end of the input stream. This allows the engine to give feedback whether that is a valid moment to end the input stream.
  • The engine requires that its stack is "empty" (has just 1 entry) when the end of the input is indicated.
  • I assumed that input symbols do not contain spaces: the space is used as a delimiter in the transition lookup algorithm
  • The given transition table is converted to a (hash)map to allow for fast lookup given the current state and the data at the top of the stack. The keys in this map are concatenations of input and stack data values, and it is here that the space was chosen as delimiter. If a space is a valid character in an input symbol, then another encoding mechanism should be selected for serializing this key-pair (e.g. JSON), but I wanted to keep this simple.
  • The engine does not need to get the set of input symbols, nor the set of stack symbols as configuration input, as these will be implied from the transition table.
  • The engine's initial state is set to OK (it could be set differently by simple property assignment, but that would be useless as it would then ignore input)

2. For the specific +/- input format

The more specific part of the solution deals with the actual input format you have given in the question. One function will covert the set of types to a transition table, and another will tokenize the input based on the "+" and "-" characters, but without any validation (no symbol check, nor symbol length check, ...), as this will all surface as error anyway when the engine is called.

Any white space in the input is ignored.

Implementation

// To peek the top element of a stack:
Array.prototype.peek = function () { return this[this.length - 1] };

function createParser(transitions) {
    function createNode(data) {
        return { 
            data, 
            children: [] 
        };
    }

    function addChild(parentNode, data) {
        const childNode = createNode(data);
        parentNode.children.push(childNode);
        return childNode;
    }

    let isOK = true; // It's a 2-state (boolean) engine. Transitions implicitly only apply to OK-state
    const stack = [createNode("")]; // Stack is private, and always has Sentinel node with value ""
    // Create a map for the transitions table, for faster lookup
    // We use space as a delimiter for the key pair, assuming an input symbol does not include spaces.
    const transMap = new Map(transitions.map(({whenInput, whenStack, thenPushValue}) =>
        [whenInput + " " + whenStack, thenPushValue]
    ));
    const parser = {
        read(input) { // Returns true when parser can accept more input after this one
            // Once the engine is in an error state, it will not process any further inputs
            if (!isOK) {
                return false;
            }
            // Consider the empty string as the end-of-input marker
            if (input === "") { 
                // Even when state is OK, the stack should now also be "empty"
                isOK &&= stack.length === 1;
                return false; // Not an error condition, but indication that no more input is expected
            }
            // Transitions do not include state requirements, nor new state definitions.
            // It is assumed that a transition can only occur in an OK state, and that all 
            //    included transitions lead to an OK state.
            const pushValue = transMap.get(input + " " + stack.peek().data);
            if (pushValue === undefined) { // No matching transition in the table implies that state is not OK
                isOK = false;
            } else {
                // As this is a two-state engine, with defined transitions only between OK states,
                // each defined transition will affect the stack: so it's either a push or pop.
                // A push is identified by the (non-empy) value to be pushed. An empty string denotes a pop.
                if (pushValue) {
                    stack.push(addChild(stack.peek(), pushValue));
                } else {
                    stack.pop();
                }
            }
            
            return isOK;
        },
        isOK, // Expose the (boolean) state
        output: stack[0].children // Don't expose the stack, but just the forest encoded in it
    };
    return parser;
}

function createTransition(whenInput, whenStack, thenPushValue) {
    return {whenInput, whenStack, thenPushValue}; // First two values imply the third
}

// Functions specific for the input format in the question:

function createTransitionTable(types) {
    // Specific to the input structure (with + and - tags) given in the question
    // An empty string in the second column represents an empty stack
    return [
        // Transitions for opening tags: O(n²)
        ...["", ...types].flatMap(stack => 
            types.map(type => createTransition("+" + type, stack, type))
        ),
        // Transitions for closing tags
        ...types.map(type => createTransition("-" + type, type, ""))
    ];
}

function tokenizer(str) { // Could be a generator, but I went for a function-returning function
    str = str.replace(/\s+/g, ""); // remove white space from input string

    let current = 0; // Maintain state between `getNextToken` function calls
    
    function getNextToken() {
        const start = current++;
        while (current < str.length && str[current] !== "+" && str[current] !== "-") {
            current++;
        }
        const token = str.slice(start, current);
        
        console.log("read", token); // For debugging
        return token;
    }
    return getNextToken;
}

// Example input language, as defined in the question:
const types = ["aaa", "bbb", "ccc", "ddd", "eee"];
const transitionTable = createTransitionTable(types);
const parser = createParser(transitionTable);

// Example input for it:
const rawInput = `
+eee-eee
+aaa+bbb-bbb-aaa
+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb`;
const getNextToken = tokenizer(rawInput);

// Process tokens
while (parser.read(getNextToken())) {}

console.log("Parser state is OK?: ", parser.isOK);
console.log("Parser output:\n");
console.log(JSON.stringify(parser.output, null, 3));
trincot
  • 317,000
  • 35
  • 244
  • 286
  • 1
    "Yet the number of trees that this PDA should be able to somehow represent (and make available) is infinite." That argument does not follow. PDAs *can* represent infinite grammars. A common example is `n` a's followed by `n` b's, for an arbitrary `n`. I'm not sure whether this sort of balanced multi-bracket configuration is context-free and thus representable by a PDA; my attempt, as noted, expanded to peek at the second element of the stack. But that the language is infinite doesn't demonstrate anything. – Scott Sauyet Jun 01 '21 at 12:04
  • I agree I could have expressed it better. Indeed, as there is an infinite stack, there can be infinite representations by that stack. My point is that the alphabet of those trees is infinite, since a node's identity is determined by its payload *and* the subtree it roots. – trincot Jun 01 '21 at 12:08
  • But balanced brackets with two different bracket types [is context-free](https://en.wikipedia.org/wiki/Context-free_grammar#Well-formed_nested_parentheses_and_square_brackets). By the same sort of grammar production rules, presumably we can also show that this language is context-free. I don't know how to immediately convert these production rules to the transition relation, but it seems that it should be possible. – Scott Sauyet Jun 01 '21 at 12:28
  • Do you speak of the input? I am speaking of the structured (object-oriented) output. (1) a PDA only validates (2) even if we consider the stack as an output tape, we would need an infinite number of "symbols" (i.e. objects) to get object-oriented output. Of course, if the desired output would be just a string (like JSON), then it is possible, but that is not how I understood the question. – trincot Jun 01 '21 at 12:30
  • Yes, perhaps we're talking past each other. The PDA in question, as I understood it, was meant to parse, e.g., `+aaa+bbb-bbb+eee+ccc-ccc-eee-aaa`. The goal was to use the states/transitions of a PDA representing this to *generate* an AST for some `a [b, e [c]]` structure. But maybe I've read wrong. @LancePollard, care to weigh in here? – Scott Sauyet Jun 01 '21 at 12:38
  • That is the same understanding I have, where the core problem in the question is *"write a PDA in JS to parse a string into an AST"*. We can see that a PDA does not generate anything. It parses the input and delivers a state; not an AST, not even a string. We need to bend the rules to make it "parse a string into an AST". – trincot Jun 01 '21 at 12:50
  • My solution to that was to add a side-effect to the parse, generating events that could be used to capture a tree. I guess that's #5 in your list (here in combination with #3.) If your point was simply that a PDA is designed only to report whether a sequence of symbols is part of a grammar, a boolean decision, then certainly I agree, although I took it as part of the premise of the question itself. – Scott Sauyet Jun 01 '21 at 12:59
  • @ScottSauyet you are right _"The goal was to use the states/transitions of a PDA representing this to generate an AST for some `a [b, e [c]]` structure"_, and @trincot is right too _"write a PDA in JS to parse a string into an AST"_. I'm not sure what I can add / help with. – Lance Jun 01 '21 at 21:31
  • I like this one so far (still need to study the implementation more) because it explains how the theory is used to implement this. I thought that PDAs could only _recognize_ (and not _generate_), but others said the opposite so I got confused. This one also has the 2 states I was looking for in the original question which I'm excited about. But @ScottSauyet your answer was also enlightening as well, it is closer to how I thought of the problem initially (emitting events), though I never could figure out how to do it cleanly / at all. – Lance Jun 01 '21 at 21:34
  • @trincot so is the `stack` here different than the PDA stack concept? It appears to be different, since the stack here is used for output instead of the PDA "stack", how does it relate? Why don't we need the PDA stack concept? – Lance Jun 01 '21 at 22:29
  • 1
    @LancePollard: Yes, we realized that we were basically on the same page. Those who told you that an PDA could only *recognize* items in a grammar were correct. All the working examples in answers here somehow extend the notion of PDAs to do something extra. Mine does so in one manner, trincot's in another. – Scott Sauyet Jun 01 '21 at 22:33
  • Also, what is the name of this sort of system you ended up implementing, is it close to something named in computer science which I could read more about? Or if not, what is it most closest too, feature-wise? – Lance Jun 01 '21 at 22:46
  • 1
    Yes, in this solution the stack is different from the PDA stack concept, because a stack element is an object with references. Therefore the stack "alphabet" is not finite. I deviated from (i.e. violated) that PDA-requirement by stating that only a specific *property* of those objects would serve as the PDA alphabet. Indeed, I used the stack for output. A pure PDA has no output other than its (atomic) state. To me the request to *"write a PDA in JS to parse a string into AST"* is self-contradicting. A PDA does not parse **into**. It can only validate. So we all came up with creative ideas ;-) – trincot Jun 02 '21 at 06:11
2

I should note at the outset that I'm no computer scientist and have no real experience writing compiler code. So there may be glaring holes in the implementation or even the basic ideas. But if you want the thoughts of a work-a-day programmer who found this an interesting problem, here they are.


We can write a pda function that simply recognizes our grammar, one that we can use like this. (Here we go only from aaa to ccc, but you could easily extend it to eee or whatever.)

const {push: PUSH, pop: POP} = pda

const myParser = pda ('S', ['S'], [
//                     ^     ^
//                     |     `----------------- accepting states
//                     +----------------------- initial state
//   +----------------------------------------- current state
//   |        +-------------------------------- token
//   |        |    +--------------------------- top of stack
//   |        |    |      +-------------------- new state
//   |        |    |      |         +---------- stack action
//   V        V    V      V         V
  [ 'S',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'S',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'S',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'A',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'A',   '-aaa', 'AA', 'A',   POP        ],
  [ 'A',   '-aaa', 'BA', 'B',   POP        ],
  [ 'A',   '-aaa', 'CA', 'C',   POP        ],
  [ 'A',   '-aaa', '',   'S',   POP        ],
  [ 'A',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'A',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'B',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'B',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'B',   '-bbb', 'AB', 'A',   POP        ],
  [ 'B',   '-bbb', 'BB', 'B',   POP        ],
  [ 'B',   '-bbb', 'CB', 'C',   POP        ],
  [ 'B',   '-bbb', '',   'S',   POP        ],
  [ 'B',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'C',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'C',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'C',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'C',   '-ccc', 'AC', 'A',   POP        ],
  [ 'C',   '-ccc', 'BC', 'B',   POP        ],
  [ 'C',   '-ccc', 'CC', 'C',   POP        ],
  [ 'C',   '-ccc', '',   'S',   POP        ],
])

And we would use it to test a series of tokens, like this:

myParser (['+aaa', '-aaa']) //=> true
myParser (['+aaa', '-bbb']) //=> false
myParser (['+aaa', '+bbb', '+ccc', '-ccc', '+aaa', '-aaa', '-bbb', '-aaa']) //=> true

This is not exactly to the mathematical definition of a PDA. We don't have a symbol to delineate the beginning of the stack, and we test the top two values of the stack, not just the top one. But it's reasonably close.

However, this just reports whether a string of tokens is in the grammar. You want something more than that. You need to use this to build a syntax tree. It's very difficult to see how to do this in the abstract. But it's easy enough to generate a sequence of events from that parsing that you could use. One approach would be just to capture the new node value at every push to the stack and capture every pop from the stack.

With that, we might tie to something like this:

const forestBuilder = () => {  // multiple-rooted, so a forest not a tree
  const top = (xs) => xs [ xs .length - 1 ]
  const forest = {children: []}
  let stack = [forest]
  return {
    push: (name) => {
      const node = {name: name, children: []}
      top (stack) .children .push (node)
      stack.push (node)
     },
    pop: () => stack.pop(),
    end: () => forest.children
  }
}

const {push, pop, end} = forestBuilder ()


push ('aaa')
push ('bbb')
pop ()
push ('ccc')
push ('aaa')
pop()
pop()
pop()
push ('bbb')
push ('aaa')
end()

which would yield something like this:

[
    {
        "name": "aaa",
        "children": [
            {
                "name": "bbb",
                "children": []
            },
            {
                "name": "ccc",
                "children": [
                    {
                        "name": "aaa",
                        "children": []
                    }
                ]
            }
        ]
    },
    {
        "name": "bbb",
        "children": [
            {
                "name": "aaa",
                "children": []
            }
        ]
    }
]

So if we supply our pda function with some event listeners for the pushes and pops (also for completion and errors), we might be able to build your tree from a series of tokens.

Here is one attempt to do this:

console .clear ()

const pda = (() => {
  const PUSH = Symbol(), POP = Symbol()
  const id = (x) => x
  return Object .assign (
    (start, accepting, transitions) => 
      (tokens = [], onPush = id, onPop = id, onComplete = id, onError = () => false) => {
        let stack = []
        let state = start
        for (let token of tokens) {
          const transition = transitions .find (([st, tk, top]) => 
            st == state && 
            tk == token &&
            (top .length == 0 || stack .slice (-top.length) .join ('') == top)
          )
          if (!transition) {
            return onError (token, stack)
          }
          const [, , , nst, action] = transition
          state = nst
          action (stack)
          if (action [PUSH]) {onPush (token)}
          if (action [POP]) {onPop ()}
        }
      return onComplete (!!accepting .includes (state))
    },{
      push: (token) => Object.assign((stack) => stack .push (token), {[PUSH]: true}),
      pop: Object.assign ((stack) => stack .pop (), {[POP]: true}),
    }
  )
})()

const {push: PUSH, pop: POP} = pda

const myParser = pda ('S', ['S'], [
//                     ^     ^
//                     |     `----------------- accepting states
//                     +----------------------- initial state
//   +----------------------------------------- current state
//   |        +-------------------------------- token
//   |        |    +--------------------------- top of stack
//   |        |    |      +-------------------- new state
//   |        |    |      |         +---------- stack action
//   V        V    V      V         V
  [ 'S',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'S',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'S',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'A',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'A',   '-aaa', 'AA', 'A',   POP        ],
  [ 'A',   '-aaa', 'BA', 'B',   POP        ],
  [ 'A',   '-aaa', 'CA', 'C',   POP        ],
  [ 'A',   '-aaa', '',   'S',   POP        ],
  [ 'A',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'A',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'B',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'B',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'B',   '-bbb', 'AB', 'A',   POP        ],
  [ 'B',   '-bbb', 'BB', 'B',   POP        ],
  [ 'B',   '-bbb', 'CB', 'C',   POP        ],
  [ 'B',   '-bbb', '',   'S',   POP        ],
  [ 'B',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'C',   '+aaa', '',   'A',   PUSH ('A') ],
  [ 'C',   '+bbb', '',   'B',   PUSH ('B') ],
  [ 'C',   '+ccc', '',   'C',   PUSH ('C') ],
  [ 'C',   '-ccc', 'AC', 'A',   POP        ],
  [ 'C',   '-ccc', 'BC', 'B',   POP        ],
  [ 'C',   '-ccc', 'CC', 'C',   POP        ],
  [ 'C',   '-ccc', '',   'S',   POP        ],
])


const forestBuilder = () => {
  const top = (xs) => xs [ xs .length - 1 ]
  const forest = {children: []}
  let stack = [forest]
  return {
    push: (name) => {
      const node = {name: name .slice (1), children: []}
      top (stack) .children .push (node)
      stack.push (node)
     },
    pop: () => stack.pop(),
    end: () => forest.children
  }
}

const {push, pop, end} = forestBuilder ()

console .log (myParser (
  ["+ccc", "-ccc", "+aaa", "+bbb", "-bbb", "-aaa", "+bbb", "+aaa", "+ccc", "+bbb", "-bbb", "-ccc", "+ccc", "-ccc", "+aaa", "+ccc", "+ccc", "-ccc", "-ccc", "-aaa", "-aaa", "-bbb"],
  push, 
  pop, 
  (accepted) => accepted ? end () : 'Error: ill-formed',
  (token, stack) => `Error: token = '${token}', stack = ${JSON.stringify(stack)}`
))
.as-console-wrapper {max-height: 100% !important; top: 0}

There are lots of ways this could go. Perhaps the opening event should contain not only the token but the value pushed on the stack. There might be a good way to generate that table of transitions from a more declarative syntax. We might want a different version of the stack action column, one that takes strings instead of take functions. And so on. But it still might be a decent start.

Scott Sauyet
  • 49,207
  • 4
  • 49
  • 103
  • For not being a computer scientist (I also have no formal training in computer science), this seems extremely close to the mathematical model and uses a lot of interesting technical concepts. It's a tough choice, I wish I could mark more than one answer as correct haha. – Lance Jun 01 '21 at 22:45
  • 1
    This was built in reference with the Wikipedia page. If I'd wanted to spend more time, I would have followed [some steps](https://www.javatpoint.com/automata-cfg-to-pda-conversion) to convert a simple grammar into a PDA. But we would still have needed some mechanism to extend from reporting whether a sentence is grammatical to building an output structure from it. – Scott Sauyet Jun 02 '21 at 01:22
1

In pseudocode, a DPDA can be implemented like this:

transitions <- map from (state, char, char) to (state, char[0-2])
stack <- stack initialized with a sentinel value 'Z'
input <- string of tokens to parse
currentState <- initial state

for each inputCharacter in input: 
  stackCharacter <- stack.pop()
  currentState, charsToPush <- transitions[(currentState, inputCharacter, stackCharacter)]
  if currentState is not valid:
    return false
  for charToPush in charsToPush.reverse():
    stack.push(charToPush)

return (currentState in acceptingStates) and (stack contains only 'Z')

A PDA such as parenthesis matching is specified like this:

transitions <- {
  (0, '(', 'Z') -> (0, "(Z"),
  (0, '(', '(') -> (0, "(("),
  (0, ')', 'Z') -> nope,
  (0, ')', '(') -> (0, ""),
}
acceptingStates <- [0]
initialState <- 0

Note that the above is deterministic. General PDAs are nondeterministic and not all context-free languages can be decided by DPDAs. Yours can, and care has to be made for how you specify transitions.

To make it more general (nondeterministic) the transition map needs to map to a list of (state, char[]) tuples instead of just one; each step in the loop needs to consider all matching tuples instead of just one.

Does that help?


For your grammar specifically, your tokens are these "+aaa", "-aaa" things. Your alphabet is finite but very large, so you don't want to have to specify everything in your transition map. So here's a decision you have to make: do you want a pure PDA (fully specified map) or do you want to code a PDA-like thing that isn't quite a PDA to avoid that?

If the latter, you want to add a check in your loop that matches the identity of the + and - tokens. You might as well just write your own code at this point, because creating a generic parser that can handle everything is a ton of work. It's easier to just write a parser for your specific needs.

And this is why people have created libraries like flap.js, because this stuff is complicated.


To elaborate on a comment I made, if you make the transition function arbitrary instead of a map, you can express your language that way.

transition <- arbitrary function taking input (state, token, token) and output (state, token[0-2])
stack <- stack initialized with a sentinel token 'Z'
input <- string of tokens to parse
currentState <- initial state

for each inputToken in input: 
  stackToken <- stack.pop()
  currentState, tokensToPush <- transition(currentState, inputToken, stackToken)
  if currentState is not valid:
    return false
  for tokenToPush in tokensToPush.reverse():
    stack.push(tokenToPush)

return (currentState in acceptingStates) and (stack contains only 'Z')

Define transition like this:

function transition(state, input, stack):
  if (input.type = +) 
    return (0, [input, stack])
  else if (input.id = stack.id)
    return (0, [])
  else
    return nope

Where your tokens have a "type" (+ or -) and an "id" ("aaa", "bbb", etc).

You have to careful with an arbitrary transition function, as the solution is now less constrained, more likely to accidentally not be context-free.

Welbog
  • 59,154
  • 9
  • 110
  • 123
  • Where in [flap.js](https://github.com/flapjs) is the code implementing the PDA? It seems to be a mess. – Lance May 28 '21 at 12:28
  • This is too close to the academic definition of a PDA, I will have to read between the lines _a lot_ to try and understand how to implement this in JS and apply it to situations beyond your example. I was hoping to see it in a less academic way (a real implementation of a simple example) so I can get beyond just the theory. I appreciate the start though, this is starting to offer a glimpse into how it might be done :) – Lance May 28 '21 at 12:31
  • If you can point to a generic implementation of a PDA then, ideally in JS or C, that would be amazing, that might help, I couldn't find any in a few hours of search, but then again I don't really know what a "good" example looks like haha. – Lance May 28 '21 at 12:33
  • It's a mess because the academic version is intentionally simple. Making something more useful gets extremely complex extremely quickly. I'm not sure what to say other than that the above should be very easy to implement in any language. I'm not sure I understand the specific difficulty you're facing? – Welbog May 28 '21 at 12:35
  • 1
    In practice I'm not sure how many libraries use PDAs, if any. Building parse trees is generally done from grammars, not PDA definitions, and uses a different style of parsing (such as LL(k) or LR(k)). – Welbog May 28 '21 at 12:39
  • I meant flap.js is a mess. A few questions, but I have a lot. (1) how do you know what to push on the stack? (2) Is your example complete for parsing nested parentheses, it looks incomplete. (3) Would it be clearer to ask for an algorithm that converts a CFG to PDA instead? The `transitions[...]` has 3 keys, so in JS I am guessing this would be a nested hash (but what if the # of input symbols is way larger, like the size of unicode, a simple hashmap won't seem ideal I don't think). – Lance May 28 '21 at 12:41
  • In your opinion, what would be the best way to parse a simplified XML document generically from a grammar for maximum performance and easy debuggability? LL(k), LR(k), others? – Lance May 28 '21 at 12:44
  • I am still learning but the [XML parsing literature](https://www.sciencedirect.com/science/article/pii/S0022000006001085) suggests pushdown automata and/or tree automata ([I still don't know the difference](https://www.quora.com/unanswered/What-are-the-differences-between-tree-automata-and-pushdown-automata-Pushdown-automata-match-context-free-languages-and-XML-is-one-of-those-and-XML-is-a-tree-So-how-are-tree-automata-different-than-pushdown-automata)). – Lance May 28 '21 at 12:47
  • (1) What you push on the stack is based on the definition of the PDA. Specifically, the transition function. (2) Yes, it is complete. (3) You are correct that a hashmap would not be ideal for a very wide alphabet. This is why we tend to use LL(k) or LR(k) parsing instead, with tokenization. For parsing, I would generally suggest LL or LR parsing, not a PDA. They are the industry standard, and have the most tools already made for them. I'm unfamiliar with tree automata, sorry. – Welbog May 28 '21 at 13:01
  • 2
    @LancePollard: "This is too close to the academic definition of a PDA". But also "The implementation must [...] closely follow the mathematical definition of a PDA." Hmmm. – Scott Sauyet May 28 '21 at 13:58
  • @ScottSauyet yeah I caught that too, by too close I mean it's literally the same as the mathematical examples, whereas by following the mathematical models I was coming from the perspective of I don't want a recursive descent parser or ad-hoc parser. So something that uses states and transitions is what I meant. – Lance May 28 '21 at 18:40
1

The solution below works in three stages:

  1. The input string is tokenized into sequences and prefixes
  2. The tokenized result, as an array, is converted to an AST via transition rules
  3. The AST is converted to the desired JSON output.

First, defining the tokens, transitions, and tokenization function:

class Token{
    constructor(vals, t_type){
       this.vals = vals;
       this.t_type = t_type
    }
    type_eq(t_type){
       return this.t_type === t_type
    }
    static val_eq(t1, t2){
       //check if two tokens objects t1 and t2 represent a +tag and a -tag
       return t1.t_type === 'o_tag' && t2.t_type === 'c_tag' && t1.vals[1].vals[0] === t2.vals[1].vals[0]
    }
}
var lexer = [[/^\-/, 'minus'], [/^\+/, 'plus'], [/^[a-z]+/, 'label']];
var transitions = [{'pattern':['plus', 'label'], 'result':'o_tag', 't_eq':false}, {'pattern':['minus', 'label'], 'result':'c_tag', 't_eq':false}, {'pattern':['o_tag', 'c_tag'], 'result':'block', 't_eq':true}, {'pattern':['o_tag', 'block', 'c_tag'], 'result':'block', 't_eq':true}, {'pattern':['block', 'block'], 'result':'block', 't_eq':false}]
function* tokenize(s){
    //tokenize an input string `s`
    //@yield Token object
    while (s.length){
       for (var [p, t] of lexer){
          var m = s.match(p)
          if (m){
             yield (new Token([m[0]], t))
             s = s.substring(m[0].length)
             break
          }
       }
    }
}

Next, defining a function that takes in the tokenized string and runs an iterative shift-reduce on a stack to build the AST from the bottom up:

function pattern_match(stack, pattern){
    //takes in the stack from `shift_reduce` and attempts to match `pattern` from the back
    if (pattern.length > stack.length){
       return false
    }
    return Array.from(Array(pattern.length).keys()).every(x => stack[stack.length-1-x].type_eq(pattern[pattern.length - 1-x]))
}
function shift_reduce(tokens){
    //consumes `tokens` until empty and returns the resulting tree if in a valid state
    var stack = []
    while (true){
        //per your comment, the line below displays the contents of the stack at each iteration
        console.log(stack.map(x => x.t_type))
        if (!stack.length){
           //stack is empty, push a token on to it
           stack.push(tokens.shift())
        }
        var f = false;
        for (var {pattern:p, result:r, t_eq:tq} of transitions){
            //try to match patterns from `transitions`
            if (pattern_match(stack, p)){
                var new_vals = p.map(_ => stack.pop()).reverse();
                if (!tq || Token.val_eq(new_vals[0], new_vals[new_vals.length-1])){
                    //match found
                    f = true
                    stack.push((new Token(new_vals, r)))
                    break
                }
                else{
                    while (new_vals.length){
                       stack.push(new_vals.shift())
                    }
                }
            }
        }
        if (!f){
           if (!tokens.length){
              //no more tokens to consume, return root of the token tree.                       
              if (stack.length > 1){ 
                 //stack was not reduced to a single token, thus an invalid state
                 throw new Error('invalid state')
              }
              return stack[0]
            }
            //no match found, push another token from `tokens` onto the stack
            stack.push(tokens.shift())
        }
    }
}

Lastly, a function to convert the AST to JSON:

function* to_json(tree){
   if (tree.vals.every(x => x.t_type === 'block')){
      for (var i of tree.vals){
         yield* to_json(i)
      }
   }
   else{
       yield {'type':tree.vals[0].vals[1].vals[0], ...(tree.vals.length === 2 ? {} : {'children':[...to_json(tree.vals[1])]})}
   }
}

Putting it all together:

function to_tree(s){
   var tokens = [...tokenize(s)] //get the tokenized string
   var tree = shift_reduce(tokens) //build the AST from the tokens
   var json_tree = [...to_json(tree)] //convert AST to JSON
   return json_tree
}
console.log(to_tree('+eee-eee'))
console.log(to_tree('+aaa+bbb-bbb-aaa'))
console.log(to_tree('+bbb+aaa+ccc+eee-eee-ccc+ccc-ccc+ddd+ccc+eee-eee-ccc-ddd-aaa-bbb'))

Output:

[
    {
        "type": "eee"
    }
]
[
    {
        "type": "aaa",
        "children": [
            {
                "type": "bbb"
            }
        ]
    }
]
[
    {
        "type": "bbb",
        "children": [
            {
                "type": "aaa",
                "children": [
                    {
                        "type": "ccc",
                        "children": [
                            {
                                "type": "eee"
                            }
                        ]
                    },
                    {
                        "type": "ccc"
                    },
                    {
                        "type": "ddd",
                        "children": [
                            {
                                "type": "ccc",
                                "children": [
                                    {
                                        "type": "eee"
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
]
Ajax1234
  • 69,937
  • 8
  • 61
  • 102
  • This is closer to what I was imagining :) I think. _"Next, defining a function that takes in the tokenized string and runs an iterative shift-reduce on a stack to build the AST from the bottom up"_ Can you explain that a little more, what does the stack look like in the steps of the algorithm, and how does it generally work. I am looking at the code but knowing why you made certain decisions would help. Also just to double check, is this considered a PDA? – Lance May 28 '21 at 18:46
  • Also this line is quite dense, I need to unpack it some more. `return Array.from(Array(pattern.length).keys()).every(x => stack[stack.length-1-x].type_eq(pattern[pattern.length - 1-x]))` Can you explain that one? – Lance May 28 '21 at 18:48
  • 1
    @LancePollard That line determines whether or not a given pattern (an array of tokens) matches a subarray of tokens in the stack starting from the end of the stack. For instance, if the pattern is `['o_tag', 'c_tag']` (which reduces to `block`), and if the current state of `stack` is `['x', 'y', 'z', 'o_tag', 'c_tag']`, a match exists for the pattern, so `pattern_match` will return `true`. Then, in `shift_reduce`, `'o_tag', 'c_tag'` (the matched contents of the stack with the pattern) get popped off stack and are replaced with `block`: `['x', 'y', 'z', 'block']`. – Ajax1234 May 28 '21 at 19:00
  • @LancePollard Regarding your first question, my solution is a *very* simple implementation of an [LR parser](https://en.wikipedia.org/wiki/LR_parser), which at its core uses a stack to store consumed input. In short, an LR parser is derived from a PDA, as it utilizes the concept of a stack (the pushdown store) along with transitions that are based on the stack itself and not just individual input (the tokens produced when `shift`ing from `tokens` at each iteration of the `while` loop in `shift_reduce`) – Ajax1234 May 28 '21 at 19:26
  • @LancePollard I added a `console.log` call in `shift_reduce` that displays the current state of the stack at each iteration of the while. When you run the code, you will see the array with all of the token types that are currently being stored. – Ajax1234 May 28 '21 at 19:34
1

I agree with @trincot's answer (except for his claim that it isn't a PDA).

I'm not sure about the complicated pattern, but the simple one you have is nearly trivial to build a machine for. It is a DCFG (deterministic context free grammar), i.e. it is the intersection of a regular expression and a Dyck (parenthesis matching) machine. All DCFGs encode a PDA. Thus, my disagreement where he says it isn't a PDA.

It is a regular expression, because your "tokens" (the parenthesis) are not a single character long, so you need to right a regular expression that turns those character sequences into single symbol tokens. +aaa -> one token e.g '(', -aaa -> another token ')', +bbb -> still another '[', ... Note the characters I picked for the tokens aren't arbitrary (although they could be) but instead to help you visualize this as parenthesis matching. Note how they pair up.

Your list of tokens will be finite (your strings, while unbounded are still finite). And, there will be two (or three, types of tokens). There will be left parens (brackets, etc), and right parens, and things that neither (i.e. don't need to match). On a left paren, you push something on the stack. On a right paren, you pop the stack. On a neither, you either ignore the stack or push and pop both--both models work.

The FSM that runs the machine needs one state for each pair. On a push you enter that state and that tells you what kind of token you need to see to pop it. If you see a different popping token, you have an error.

Now, as long as your token types are easily divided into these three types of tokens, the problem is trivial. If your tokens, don't, for example, if you are looking for palindromes without a mid-point token, (i.e. some token can be both a left paren and a right paren and you cannot tell from left context which it is), the problem becomes non-deterministic and you will need to implement a GLR type parser that keeps a parse forest of alternatives that are still candidates (and if the input is ambiguous, end up with more than one possible tree).

However, I think if you are trying to parse ASTs, you won't have that issue. You really have a very simplified version of the SLR (most basic LR parsing algorithm) at the parens level. And the conversion of sequences to regular expressions is also likely to be trivial, because they will just be a set of fixed strings.

intel_chris
  • 708
  • 1
  • 6
  • 17