I need to analyze Elisp (Emacs Lisp) code so I wrote a parser for it using Instaparse. I expected it to be slow but doing 1k lines per second is way too slow to be right even on a calculator (or my pretty old i7). Can it be that bad or do I do something extremely wrong?
It's unambiguous and I tried to keep look ahead/behinds at minimum, unfortunately Elisp is very liberal with what constitutes as a symbol so I had to add some ahead/behinds there to differentiate numbers and symbols. Also I tried to deffer this by parsing symbols, numbers and keywords as "ident" it only gave me back like 30% of time. From my tests, it looks like Instaparse struggles a lot with recursive rules and lisps have highly recursive nature so maybe I didn't mess it up - it's just that slow...
The parser:
(ns slowparse
(:require [clojure.string :as str]
[instaparse.combinators :as c]
[instaparse.core :as insta]))
(def grammar
"Elisp grammar."
"<root> = any +
<any> = sexp | keyword | number | symbol | prefix | string | vector |
comment | whitespace | char | Epsilon
comment = comment-tok #'(?:[^\\n]*|$)'
string = <str-l-tok> #'(?:(?:\\\\\\\\)|(?:\\\\\")|[^\"])*' <str-r-tok>
char = <char-tok> #'(?:(?:\\\\(?:C|M)-)|(?:\\\\))?(?:.|\\s)'
<whitespace> = <#'\\s+'>
sexp = sexp-l-tok any + sexp-r-tok
vector = vec-l-tok any + vec-r-tok
<prefix> = quote | template | spread | hole
<prfxbl> = sexp | symbol | keyword | number | prefix | vector
quote = quote-tok prfxbl
template = tmpl-tok prfxbl
hole = hole-tok ! spread-tok prfxbl
spread = hole-tok spread-tok prfxbl
<sexp-l-tok> = <'('>
<sexp-r-tok> = <')'>
<vec-l-tok> = <'['>
<vec-r-tok> = <']'>
<str-l-tok> = <'\"'>
<str-r-tok> = <'\"'>
<quote-tok> = '#' ? <\"'\">
<tmpl-tok> = <'`'>
<num-b-x-tok> = '#'
<hole-tok> = <','>
<spread-tok> = <'@'>
<comment-tok> = <';'>
<char-tok> = '?'
<kv-tok> = <':'>
symbol = ! ( number | kv-tok | comment-tok | num-b-x-tok | char-tok )
ident
keyword = kv-tok ident
number = num-b10 | num-bx
<num-b10> = #'[-+]?(?:(?:[\\d]*\\.[\\d]+)|(?:[\\d]+\\.[\\d]*)|(?:[\\d]+))' &
( ! ident )
<num-bx> = #'(?i)#(?:b|o|x|(?:\\d+r))[-+]?[a-z0-9]+'")
(def ident
{:ident
(let [esc-ch (str/join ["\\[" "\\]" "\\(" "\\)" "\"" "\\s" "'" "," "`" ";"])
tmpl "(?:(?:\\\\[{{ec}}])|[^{{ec}}])+"]
(->> esc-ch (str/replace tmpl "{{ec}}") c/regexp c/hide-tag))})
(insta/defparser ^{:doc "Elisp parser."} elisp-parser
(merge ident (c/ebnf grammar))
:start :root)
(def test-text (slurp "/tmp/foo.el"))
(time (insta/parse elisp-parser test-text))