How to develop a lexical analyzer with javascript?

Question

I developed a lexical analyzer function which gets a string and separate the items in string in an array like this :

const lexer = (str) =>
  str
    .split(" ")
    .map((s) => s.trim())
    .filter((s) => s.length);

console.log(lexer("John Doe")) // outputs ["John" , "Doe"]

Now I want to develop a lexical analyzer with javascript to analyze types , something like this :

if (foo) {
  bar();
}

and return the output like this :

[
  {
    lexeme: 'if',
    type: 'keyword',
    position: {
      row: 0,
      col: 0
    }
  },
  {
    lexeme: '(',
    type: 'open_paran',
    position: {
      row: 0,
      col: 3
    }
  },
  {
    lexeme: 'foo',
    type: 'identifier',
    position: {
      row: 0,
      col: 4
    }
  },
  ...
]

How can I develop a lexical analyzer with javascript to identify types ?

Thanks in advance .

Same way as in any other language (if you're really set on doing this from scratch). Just searching will provide both how-tos and existing lexers. — Dave Newton, Jul 22 '21 at 18:17
As asked the question is quite broad--lexing and parsing are the subjects of entire books and semester-long projects. If you have *specific* questions they're welcome, but as it stands, your best bet would be to do some research first. — Dave Newton, Jul 22 '21 at 18:31

score 2 · Accepted Answer · answered Jul 22 '21 at 18:43

The most common pattern I've seen for lexing in JavaScript (in e.g. KaTeX and CoffeeScript) is to define a regular expression that encompasses all the tokens you might see, and somehow iterate through the matches of that regular expression.

Here's a simple lexer that covers your JavaScript example (but also skips over invalid content):

const tokenRegExp = /[(){}\n]|(\w+)/g;
const tokenMap = {
  '(': 'open_paren',
  ')': 'close_paren',
  '{': 'open_brace',
  '}': 'close_brace',
}
let row = 0, col = 0;
const tokens = [];
while (let match = tokenRegExp.exec(input)) {
  let type;
  if (match[1]) { // use groups to identify which part of the RegExp is matching
    type = 'identifier';
  } else if (tokenMap[match[0]]) { // use lookup table for simple tokens
    type = tokenMap[match[0]];
  }
  if (type) {
    tokens.push({
      lexeme: match[0],
      type,
      position: {row, col},
    });
  }
  // Update row and column number
  if (match[0] === '\n') {
    row++;
    col = 0;
  } else {
    col += match[0].length;
  }
}

Other parsers will use the regular expression to match a prefix of the string, then discard that part of the string, and continue matching from where it left off. (This avoids skipping over invalid content.)

I wouldn't recommend writing your own JavaScript lexer though, except for education purposes; there are many out there that will probably catch more edge cases than you can without a lot of effort.

How to develop a lexical analyzer with javascript?

1 Answers1