Tokenizing a string in Clojure

Question

I am trying to tokenize a string using clojure. The basic tokenization rules require the string to be split into separate symbols as follows:

String literals of the form "hello world" are a single token
Every word that is not part of a string literal is a single token
Every non-word character is a separate token

For example, given the string: length=Keyboard.readInt("HOW MANY NUMBERS? ");

I would like it to be tokenized as:

["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]

I have been able to write a function to split a string according to rules 2 and 3 above. I am having trouble fulfilling the first rule. Meaning, currently the above string is split as follows:

["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]

Here is my function:

(defn TokenizeJackLine [LineOfJackFile]
  (filter not-empty 
    (->
 (string/trim LineOfJackFile)
 ; get rid of all comments
 (string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
 ; split into tokens using 0-width look-ahead
 (string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
 )))

How can I write a function that will split a string into tokens following all three of the above rules? Alternatively, what other approach should I take to achieve the desired tokenization? Thank you.

Instead of regexes (wrong tool for the job!) I recommend instaparse. https://github.com/Engelberg/instaparse — mike3996, Jun 05 '14 at 12:43
@progo Thanks, but would be nice to find a simple built-in way to do this. — Yechiel Labunskiy, Jun 05 '14 at 12:50
@noisesmith I am not trying to parse the string at this point, rather to split it into tokens. — Yechiel Labunskiy, Jun 05 '14 at 14:07
lexing is an implementation detail, some parsers (like yacc) have separate lexing stages, some (like clojure's instaparse) don't separate those stages — noisesmith, Jun 05 '14 at 14:11
Thanks for the suggestion, I ended up using Instaparse which got the job done nicely. Would still be nice to know a way to do what I wanted in regex (seems like there should be a way). — Yechiel Labunskiy, Jun 07 '14 at 19:35

score 1 · Answer 1 · answered Jun 05 '14 at 17:03

Removing the initial \s+| from your split makes it work the way that you want it to. That is causing the string to split on white space characters.

(defn TokenizeJackLine [LineOfJackFile]
  (filter not-empty 
    (->
 (clojure.string/trim LineOfJackFile)
 ; get rid of all comments
 (clojure.string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
 ; split into tokens using 0-width look-ahead
 (clojure.string/split #"(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
 )))

(def input "length=Keyboard.readInt(\"HOW MANY NUMBERS? \");")
(TokenizeJackLine input)

Produces this output:

("length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";")

Yes, but it will fail on other strings. Ex: for the string: `"function void main() {"` the result is incorrect: `["function void main" "(" ")" " " "{"]` — Yechiel Labunskiy, Jun 05 '14 at 17:08
Ah... ok, you are right. You need a way pick off the quoted string, and that's a bit beyond my regex skills. — Danny, Jun 05 '14 at 17:50

Tokenizing a string in Clojure

1 Answers1