I am trying to tokenize a string using clojure. The basic tokenization rules require the string to be split into separate symbols as follows:
- String literals of the form "hello world" are a single token
- Every word that is not part of a string literal is a single token
- Every non-word character is a separate token
For example, given the string:
length=Keyboard.readInt("HOW MANY NUMBERS? ");
I would like it to be tokenized as:
["length" "=" "Keyboard" "." "readInt" "(" "\"HOW MANY NUMBERS? \"" ")" ";"]
I have been able to write a function to split a string according to rules 2 and 3 above. I am having trouble fulfilling the first rule. Meaning, currently the above string is split as follows:
["let" "length" "=" "Keyboard" "." "readInt" "(" "\"HOW" "MANY" "NUMBERS?" "\"" ")" ";"]
Here is my function:
(defn TokenizeJackLine [LineOfJackFile]
(filter not-empty
(->
(string/trim LineOfJackFile)
; get rid of all comments
(string/replace #"(//.*)|(\s*/?\*.*?($|\*/))|([^/\*]*\*/)" "")
; split into tokens using 0-width look-ahead
(string/split #"\s+|(?<=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])|(?=[\{\}\(\)\[\]\.,;+\-\*/&\|<>=~])")
)))
How can I write a function that will split a string into tokens following all three of the above rules? Alternatively, what other approach should I take to achieve the desired tokenization? Thank you.