0

I am trying to build a compiler in javascript and until now I've managed to build a lexer that creates tokens based on input:

= Test Input (with optional semicolon):

data myVariable = 4 
data myVariable2 = "myName";
task Eat {
    receives (whatToEat : String, howMuchTime : Float)
    print whatToEat
    returns (Nothing : Void)
}

= Actual Lexer Result (From Console - The Array Of Tokens):

[0: {content: "data", denominal: "keyword"}
 1: {content: "myVariable", denominal: "identifier"}
 2: {content: "=", denominal: "operator"}
 3: {content: "4", denominal: "number"}

 4: {content: "data", denominal: "keyword"}
 5: {content: "myVariable2", denominal: "identifier"}
 6: {content: "=", denominal: "operator"}
 7: {content: ""myName"", denominal: "string"}
 8: {content: ";", denominal: "punctuator"}

 9: {content: "task", denominal: "keyword"}
 10: {content: "Eat", denominal: "identifier"}
 11: {content: "{", denominal: "punctuator"}
 12: {content: "receives", denominal: "identifier"}
 13: {content: "(", denominal: "punctuator"}
 14: {content: "whatToEat", denominal: "identifier"}
 15: {content: ":", denominal: "punctuator"}
 16: {content: "String", denominal: "identifier"}
 17: {content: ",", denominal: "punctuator"}
 18: {content: "howMuchTime", denominal: "identifier"}
 19: {content: ":", denominal: "punctuator"}
 20: {content: "Float", denominal: "identifier"}
 21: {content: ")", denominal: "punctuator"}
 22: {content: "print", denominal: "keyword"}
 23: {content: "whatToEat", denominal: "identifier"}
 24: {content: "returns", denominal: "keyword"}
 25: {content: "(", denominal: "punctuator"}
 26: {content: "Nothing", denominal: "identifier"}
 27: {content: ":", denominal: "punctuator"}
 28: {content: "Void", denominal: "identifier"}
 29: {content: ")", denominal: "punctuator"}
 30: {content: "}", denominal: "punctuator"}]

the lexer is just doing fine (data and task being keywords for variable and function) BUT i would want to create something regex-like that captures me a function declaration, a variable declaration etc. USING ONLY this current token object as input If it would have been text, I would have captured the function declaration with following regex:

task\s+[a-zA-Z][a-zA-Z0-9]*\s*\{\s*(1)*\s*\}

(1) being regex code for instruction block, including keyword functions as receives etc.

Is there a way to match variable / function declarations, in this case starting from an index that changes during a for loop?

for example:

= I've passed through my token list using a for loop and at index 9 I've found this object for the first time:

9: {content: "task", denominal: "keyword"}

= Now, i want to start searching for a function declaration on the object. This implies:

1) - if the function is correct as declaration, parantheses etc. etc.

2) - how many objects does this function imply - like from index 9 to index 30, all these objects form a function called 'Eat', which has 3 instruction blocks:

  • 1 special receives instruction block, put mandatory at start of the function (even empty), containing arguments as correct format [variableName : variableType]

  • 1 special print instruction block with its parameters given correctly

  • 1 special returns instruction block, put mandatory at start of the function (if empty, returning Nothing : Void), containing arguments as correct format [variableName : variableType]

3) - where to stop, so now I know the function definition is over and I can start searching from the final index + 1 = 31 in this case, for other things (ex. variable declarations, EOF etc.)

If you are kind to tell me the method(s) so I can establish the existence of a specific instruction block, creating the upper function description example, it would be awesome!

The ideal result (for this problem) would be an array like this:

Instruction Object:

[0: {
     "instruction": "variable_declaration",
     "variable_name": "myVariable",
     "variable_value": "4",
     "variable_type": "Integer"
    }
 1: {
     "instruction": "variable_declaration",
     "variable_name": "myVariable2",
     "variable_value": ""myName"",
     "variable_type": "String"
    }
 2: {
     "instruction": "function_declaration",
     "function_name": "Eat",
     "body_instructions": [0: {
                               "instruction": "receives_instruction", 
                               "arguments": [0: {
                                                 "argument_name": "whatToEat",
                                                 "argument_type": "String"
                                                }
                                             1: {
                                                 "argument_name": "howMuchTime",
                                                 "argument_type": "Float"
                                                }]
                           1: {
                               "instruction": "print_instruction",
                               "arguments": [0: {
                                                 "argument_name": "myVariable",
                                                 "argument_value": "4",
                                                 "argument_type": "Integer"
                                                }]
                           2: {
                               "instruction": "returns_instruction", 
                               "arguments": [0: {
                                                 "argument_name": "Nothing",
                                                 "argument_value": "",
                                                 "argument_type": "Void"
                                                }]
                              }]
    }] // EOF object optional

I appreciate all your help!

Thanks a lot in advance!

Alex Tudor
  • 124
  • 1
  • 3
  • I don't know what the underlying grammar of your language is, but most programming languages are based on context free grammars and require something more powerful than a finite automata (regex engine, e.g.) to parse them; they need a pushdown automata. Notwithstanding the fact that *some* regex engines have been enhanced to emulate pushdown automatas, I can't help but feel that for your next phase, i.e. parsing, you need to (or should) switch to a different tool. Maybe check out *recursive descent parsing*. Just my two cents. – Booboo Nov 12 '19 at 17:42
  • @RonaldAaronson I appreciate your advices, but for this compiler (at least in its case), I would like to keep it as basic as possible - for example, these variable / function unique-made declarations are made to be "transformed" immediately after to JS code - not too far from a .replace() but a little bit smarter :) The thing is that is case of recursivity I have to know the exact location of "pushing" into the object, which represents the hardest part for me. – Alex Tudor Nov 12 '19 at 19:48
  • @RonaldAaronson For example, I would start with location "null", then after creating function_declaration I would not know if I should replace location with "AST[0]" or "AST[1]". Initially, this was my first option, but I found out I don't know how to set this location in code (by "entering" inside a new child / "exiting" a full child to go back to its parent object/array). If you could attach an example with this kind of recursivity, you would be a big big genius! Thanks a lot! :) :) – Alex Tudor Nov 12 '19 at 19:52
  • One question is whether the punctuator `}` can appear inside the body of your function. If it can, then you almost certainly cannot do this with regex-like tools. You will need something more sophisticated. I haven't done a great deal of parsing, but I have managed to write some parsers with Peg.js and with just plain state machines simulating pushdown automata. Either could work here, I would expect. – Scott Sauyet Nov 19 '19 at 18:43
  • @ScottSauyet please give me an informative example about how it would look considering no } inside function (just as the Python if-s :)). The primary purpose of building this AST for my compiler is **checking for any syntax errors** based on bad input appearance & **convert this object somehow back to JavaScript code**, using an eval maybe afterwards. So, the main need is the actual transformation of the object, not an exceptional case, and if you could attach any code snippet that would help me with that (I initially thought at a recursion-based method), it would be magnificent. Thanks a lot! – Alex Tudor Nov 20 '19 at 08:32
  • Well, you say a recursion-based method, but just as regular expressions can (generally) only parse regular languages, any technique patterned off them (without some helper stack) will not be able to parse recursive structures. Note that Python `if` statements can be nested to arbitrary degrees too, so you can't use regex to parse Python. Think about trying to write a regex to parse arbitrarily nested parentheses, finding which ones are legal, e.g. `((()((()())()))())` is legal but `((())((()())())()))` is illegal. – Scott Sauyet Nov 20 '19 at 16:03
  • But also note that "please give me an informative example about..." is not how StackOverflow works. Please visit the [help center](https://stackoverflow.com/help), and read up on [asking good questions](https://stackoverflow.com/help/asking). After doing some research and [searching](https://stackoverflow.com/help/searching) for related topics on SO, try it yourself. If you're stuck, post a [Minimal, Complete, and Verifiable example](https://stackoverflow.com/help/mcve) of your attempt note exactly where you're stuck. People will be glad to help. – Scott Sauyet Nov 20 '19 at 16:04

0 Answers0