0

I need to analyze Elisp (Emacs Lisp) code so I wrote a parser for it using Instaparse. I expected it to be slow but doing 1k lines per second is way too slow to be right even on a calculator (or my pretty old i7). Can it be that bad or do I do something extremely wrong?

It's unambiguous and I tried to keep look ahead/behinds at minimum, unfortunately Elisp is very liberal with what constitutes as a symbol so I had to add some ahead/behinds there to differentiate numbers and symbols. Also I tried to deffer this by parsing symbols, numbers and keywords as "ident" it only gave me back like 30% of time. From my tests, it looks like Instaparse struggles a lot with recursive rules and lisps have highly recursive nature so maybe I didn't mess it up - it's just that slow...

The parser:

(ns slowparse
  (:require [clojure.string :as str]
            [instaparse.combinators :as c]
            [instaparse.core :as insta]))

(def grammar
  "Elisp grammar."
  "<root> = any +

  <any> = sexp | keyword | number | symbol | prefix | string | vector |
          comment | whitespace | char | Epsilon

  comment = comment-tok #'(?:[^\\n]*|$)'

  string = <str-l-tok> #'(?:(?:\\\\\\\\)|(?:\\\\\")|[^\"])*' <str-r-tok>

  char = <char-tok> #'(?:(?:\\\\(?:C|M)-)|(?:\\\\))?(?:.|\\s)'

  <whitespace> = <#'\\s+'>

  sexp   = sexp-l-tok any + sexp-r-tok

  vector = vec-l-tok any + vec-r-tok

  <prefix>   = quote | template | spread | hole

  <prfxbl> = sexp | symbol | keyword | number | prefix | vector

  quote    = quote-tok prfxbl
  template = tmpl-tok prfxbl
  hole     = hole-tok ! spread-tok prfxbl
  spread   = hole-tok spread-tok prfxbl

  <sexp-l-tok>      = <'('>
  <sexp-r-tok>      = <')'>

  <vec-l-tok>       = <'['>
  <vec-r-tok>       = <']'>

  <str-l-tok>       = <'\"'>
  <str-r-tok>       = <'\"'>

  <quote-tok>       = '#' ? <\"'\">

  <tmpl-tok>        = <'`'>

  <num-b-x-tok>     = '#'

  <hole-tok>        = <','>

  <spread-tok>      = <'@'>

  <comment-tok>     = <';'>

  <char-tok>        = '?'

  <kv-tok>          = <':'>

  symbol    = ! ( number | kv-tok | comment-tok | num-b-x-tok | char-tok )
               ident

  keyword = kv-tok ident

  number    = num-b10 | num-bx
  <num-b10> = #'[-+]?(?:(?:[\\d]*\\.[\\d]+)|(?:[\\d]+\\.[\\d]*)|(?:[\\d]+))' &
              ( ! ident )
  <num-bx>  = #'(?i)#(?:b|o|x|(?:\\d+r))[-+]?[a-z0-9]+'")

(def ident
  {:ident
   (let [esc-ch (str/join ["\\[" "\\]" "\\(" "\\)" "\"" "\\s" "'" "," "`" ";"])
         tmpl "(?:(?:\\\\[{{ec}}])|[^{{ec}}])+"]
     (->> esc-ch (str/replace tmpl "{{ec}}") c/regexp c/hide-tag))})

(insta/defparser ^{:doc "Elisp parser."} elisp-parser
  (merge ident (c/ebnf grammar))
  :start :root)

(def test-text (slurp "/tmp/foo.el"))

(time (insta/parse elisp-parser test-text))
JAre
  • 4,666
  • 3
  • 27
  • 45
  • 1K lines/sec doesn't sound too slow to me, esp. considering it is a high-level, functional, dynamic language. What would be the cost in labor hours (times your hourly rate!) if you recoded everything in lexx/yacc using C? – Alan Thompson Oct 02 '20 at 04:01
  • Yes, instaparse is slow. When I discovered this I immediately switched to ANTLR. – akond Oct 02 '20 at 04:03
  • @akond Hmm. Looks like syntax for the rules is similar, except regex fragments look limited. So maybe it is worth a while to port the parser. How much faster ANTLR is ? – JAre Oct 02 '20 at 04:17
  • 1
    I don't remember the exact numbers, but at least 10x faster. To put it another way, it was good enough for me. – akond Oct 02 '20 at 04:25
  • 1
    The "any" definition is an unordered choice that always matches at least 2 branches, one of which is Epsilon. If you can find a way to reduce the search space, you might speed up the parse without having to back off to a more primitive tool. – Biped Phill Oct 02 '20 at 09:29
  • Maybe a dumb question: what does Instaparse give you that a simple read can't do? I've always regarded LISP's "program as readable data structure" concept one of its biggest strengths. – Ralf Kleberhoff Oct 02 '20 at 12:29
  • @BipedPhill Yeah I tried ordering - it doesn't improve things substantially. Instaparse seems to be able to do more optimizations on unordered choices. – JAre Oct 02 '20 at 16:04
  • @RalfKleberhoff It's somewhat complicated with prefixes and the fact that symbols can contain pretty much anything. Sure I can do it with read + bracket counting and a bunch or regex. The man thing Instaparse gave me here is readable and recognizable format to put all the bits and pieces together. Also it can use state under the hood and be ugly without me needing to touch it :) Besides, I don't think it will be particularly fast since you have to combine rules in a clever way. – JAre Oct 02 '20 at 16:11

2 Answers2

1

As @akond suggested I ported the grammar to ANTLR (using https://github.com/aphyr/clj-antlr). It parses 1k lines in 20ms or less... Yeah looks like Instaparse is really slow.

Didn't have to change much, but Instaparse definitely feels a lot nicer to write rules for. It has simple ordering and look ahead/behind, standard regex, easy way to hide junk.

ANTLR grammar:

(ns fastparse
  (:require [clj-antlr.core :as antlr]))

(def grammar
  "Elisp grammar."
  "grammar EmacsLisp ;

   source: any* EOF ;

   any: list | keyword | number | symbol | prefix | string | vector | char |
        whitespace | comment;

   vector: '[' any* ']' ;

   list: '(' any* ')' ;

   prefix: quote | template | spread | hole ;

   quote: '#' ? '\\'' any ;

   template: '`' any ;

   spread: ',@' any ;

   hole: ',' any ;

   number: NUMB10 | NUMBX ;

   char: CHAR ;

   string: STRING ;

   keyword: KEYWORD ;

   symbol: IDENT ;

   whitespace: WS ;

   comment: COMLINE ;

   CHAR: '?' ( ( '\\\\' ( 'C' | 'M' ) '-' ) | '\\\\' )? . ;

   STRING: '\"' ( '\\\\\\\\' | '\\\\\"' | . )*? '\"' ;

   NUMB10: [+-] ? ( ( D* '.' D+ ) | ( D+ '.' D* ) | D+ ) ;

   NUMBX: '#' ( 'b' | 'o' | 'x' | ( D+ 'r' ) ) [-+]? ( A | D )+ ;

   fragment
   D: '0'..'9' ;

   fragment
   A: 'a'..'z' ;

   KEYWORD: ':' IDENT ;

   IDENT: ( ( '\\\\' [\\\\[\\]() \\n\\t\\r\"',`;] )+? |
            ( ~[[\\]() \\n\\t\\r\"',`;] )+? )+ ;

   COMLINE: ';' ~[\\n\\r]* ;

   WS: [ \\n\\t\\r]+ ;")

(def elisp-str->edn (antlr/parser grammar))

(def text (slurp "/tmp/foo.el"))

(time (elisp-str->edn text))
JAre
  • 4,666
  • 3
  • 27
  • 45
0

If you are interested in speed, and you do not want to worry for stack overflow occurrences, you may try Tunnel Grammar Studio, a parser generator I work on. The generated parsers from it are iterative during lexing, parsing, the tree construction, tree iteration, tree to string conversion and tree release. The accepted grammars are in ABNF (RFC 5234) with case sensitivity per token (RFC 7405).

Its a good idea to have a deterministic grammar with any parser you use. TGS does check for LL(1) conflicts at compile time, and will help you to create a deterministic grammar by visualizing the conflict places.

There is a demo of the tool, you can test the speed yourself. There is an option in the tool to generate fully ready test case project that will log into the console at runtime the time taken to parse, iterate the tree and free it, by only supplying the input data. Meaning that no development (except compiling the generated code) from you is expected if you want to test the speed for a grammar.

In my tests with a JSON grammar (RFC 8259) with removed ambiguity, that only emits syntax tree build events (like SAX) the iterative parser runs with around 8 megabytes per second, that are many lines per second and takes memory only proportional to the depth of the parsing, because only one token is technically needed at runtime for LL(1) grammars, i.e. it is practically "streaming" the input.

You can have also statically typed or dynamically typed concrete syntax tree, or dynamically typed abstract syntax tree with different levels of abstraction (i.e. automatic node pruning). The syntax trees builders (if chosen) for this trees are using the build events to create the relevant trees. You will need an ABNF grammar and C++ as a language target however.

The tool supports token ranges inside the parser grammar (additionally to the character ranges inside the lexer grammar). This means that you may develop your grammar without the extra care of the lexical rules order.

Nikolay Handzhiyski
  • 1,360
  • 1
  • 6
  • 20