1

I am trying to tokenize a string using clojure. The basic tokenization rules require the string to be split into separate symbols as follows:

  1. String literals of the form "hello world" are a single token
  2. Every word that is not part of a string literal is a single token
  3. Every non-word character is a separate token

For example, given the string: length=Keyboard.readInt("HOW MANY NUMBERS? ");

I would like it to be tokenized as:

["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]

I have been able to write a function to split a string according to rules 2 and 3 above. I am having trouble fulfilling the first rule. Meaning, currently the above string is split as follows:

["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]

Here is my function:

(defn TokenizeJackLine [LineOfJackFile]
  (filter not-empty 
    (->
 (string/trim LineOfJackFile)
 ; get rid of all comments
 (string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
 ; split into tokens using 0-width look-ahead
 (string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
 )))

How can I write a function that will split a string into tokens following all three of the above rules? Alternatively, what other approach should I take to achieve the desired tokenization? Thank you.

Infinite Recursion
  • 6,511
  • 28
  • 39
  • 51

1 Answers1

1

Removing the initial \s+| from your split makes it work the way that you want it to. That is causing the string to split on white space characters.

(defn TokenizeJackLine [LineOfJackFile]
  (filter not-empty 
    (->
 (clojure.string/trim LineOfJackFile)
 ; get rid of all comments
 (clojure.string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "") 
 ; split into tokens using 0-width look-ahead
 (clojure.string/split #"(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
 )))

(def input "length=Keyboard.readInt(\"HOW MANY NUMBERS? \");")
(TokenizeJackLine input)

Produces this output:

("length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";")
Danny
  • 354
  • 1
  • 5
  • Yes, but it will fail on other strings. Ex: for the string: `"function void main() {"` the result is incorrect: `["function void main" "(" ")" " " "{"]` – Yechiel Labunskiy Jun 05 '14 at 17:08
  • Ah... ok, you are right. You need a way pick off the quoted string, and that's a bit beyond my regex skills. – Danny Jun 05 '14 at 17:50