Functionally split a string by whitespace, group by quotes!

Question

Writing idiomatic functional code, in Clojure[1], how one would write a function that splits a string by whitespace but keeps quoted phrases intact? A quick solution is of course to use regular expressions but this should be possible without them. At a quick glance it seems pretty hard! I've written a similar in imperative languages but I'd like to see how a functional, recursive approach works.

A quick checkout of what our function should do:

"Hello there!"  -> ["Hello", "there!"]
"'A quoted phrase'" -> ["A quoted phrase"]
"'a' 'b' c d" -> ["a", "b", "c", "d"]
"'a b' 'c d'" -> ["a b", "c d"]
"Mid'dle 'quotes do not concern me'" -> ["Mid'dle", "quotes do not concern me"]

I don't mind if the spacing changes between the quotes (so that one can use simple splitting by whitespace first).

"'lots    of   spacing' there" -> ["lots of spacing", "there"] ;is ok to me

[1] This question could be answered in general level but I guess that a functional approach in Clojure can be translated to Haskell, ML, etc with ease.

In the example with "middle quotes" I noticed that a single quote got left out completely. Was this intentional? — Deniz Dogan, Dec 02 '10 at 12:26
Intentional. My view to this problem is that only the beginning and end of a word matters. Don't know if it's practical though... — mike3996, Dec 02 '10 at 12:27
I guess an okay approach to this would be to first just split the strings at spaces similar to Python's `split`. This should be nearly trivial. Then you could probably look through the list for any word that begins with an apostrophe and if one is found, continue looking until you find a word that ends with one, then merge the elements you moved over. Kind of... — Deniz Dogan, Dec 02 '10 at 12:35
Deniz: my imperative approach used that method. I'm sketching a recursive solution but don't know if that is going to work... — mike3996, Dec 02 '10 at 12:38

Michał Marczyk · Accepted Answer · 2010-12-05T00:28:15.253

Here's a version returning a lazy seq of words / quoted strings:

(defn splitter [s]
  (lazy-seq
   (when-let [c (first s)]
     (cond
      (Character/isSpace c)
      (splitter (rest s))
      (= \' c)
      (let [[w* r*] (split-with #(not= \' %) (rest s))]
        (if (= \' (first r*))
          (cons (apply str w*) (splitter (rest r*)))
          (cons (apply str w*) nil)))
      :else
      (let [[w r] (split-with #(not (Character/isSpace %)) s)]
        (cons (apply str w) (splitter r)))))))

A test run:

user> (doseq [x ["Hello there!"
                 "'A quoted phrase'"
                 "'a' 'b' c d"
                 "'a b' 'c d'"
                 "Mid'dle 'quotes do not concern me'"
                 "'lots    of   spacing' there"]]
        (prn (splitter x)))
("Hello" "there!")
("A quoted phrase")
("a" "b" "c" "d")
("a b" "c d")
("Mid'dle" "quotes do not concern me")
("lots    of   spacing" "there")
nil

If single quotes in the input don't match up properly, everything from the final opening single quote is taken to constitute one "word":

user> (splitter "'asdf")
("asdf")

Update: Another version in answer to edbond's comment, with better handling of quote characters inside words:

(defn splitter [s]
  ((fn step [xys]
     (lazy-seq
      (when-let [c (ffirst xys)]
        (cond
         (Character/isSpace c)
         (step (rest xys))
         (= \' c)
         (let [[w* r*]
               (split-with (fn [[x y]]
                             (or (not= \' x)
                                 (not (or (nil? y)
                                          (Character/isSpace y)))))
                           (rest xys))]
           (if (= \' (ffirst r*))
             (cons (apply str (map first w*)) (step (rest r*)))
             (cons (apply str (map first w*)) nil)))
         :else
         (let [[w r] (split-with (fn [[x y]] (not (Character/isSpace x))) xys)]
           (cons (apply str (map first w)) (step r)))))))
   (partition 2 1 (lazy-cat s [nil]))))

A test run:

user> (doseq [x ["Hello there!"
                 "'A quoted phrase'"
                 "'a' 'b' c d"
                 "'a b' 'c d'"
                 "Mid'dle 'quotes do not concern me'"
                 "'lots    of   spacing' there"
                 "Mid'dle 'quotes do no't concern me'"
                 "'asdf"]]
        (prn (splitter x)))
("Hello" "there!")
("A quoted phrase")
("a" "b" "c" "d")
("a b" "c d")
("Mid'dle" "quotes do not concern me")
("lots    of   spacing" "there")
("Mid'dle" "quotes do no't concern me")
("asdf")
nil

So good... I'm not very good at Clojure's lazy sequences... should that splitter go with `recur` or so? But the execution looks very idiomatic and it saves the spacing! Excellent :) — mike3996, Dec 03 '10 at 09:30
user=> (splitter "Mid'dle 'quotes do no't concern me'") ("Mid'dle" "quotes do no" "t" "concern" "me'") I would leave quote if it surrounded by two chars. — edbond, Dec 03 '10 at 16:20
@progo: Happy to hear that. :-) As for `recur`, no, lazy seqs are not to be mixed with tail recursion. See [this SO question](http://stackoverflow.com/questions/3247045) for more details (the answers might be useful as a general intro to lazy seqs, and as for lazy seqs vs. tail recursion, I tried to address that point in my answer). @edbond: Yeah, I was too lazy with that. The version edited in just now should handle that sort of cases better. — Michał Marczyk, Dec 03 '10 at 20:51
There's something going on... the last character may be eaten, in `"Hello there!"` and `"'lots of spacing' there"`. — mike3996, Dec 04 '10 at 08:24

max taldykin · Answer 2 · 2015-12-23T08:07:24.840

This solution is in haskell, but main idea should be applicable in clojure also.
Two states of parser (inside or outside of quotes) are represented by two mutually recursive functions.

splitq = outside [] . (' ':)

add c res = if null res then [[c]] else map (++[c]) res

outside res xs = case xs of
    ' '  : ' '  : ys -> outside res $ ' ' : ys
    ' '  : '\'' : ys -> res ++ inside [] ys
    ' '  : ys        -> res ++ outside [] ys
    c    : ys        -> outside (add c res) ys
    _                -> res

inside res xs = case xs of
    ' '  : ' ' : ys -> inside res $ ' ' : ys
    '\'' : ' ' : ys -> res ++ outside [] (' ' : ys)
    '\'' : []       -> res
    c    : ys       -> inside (add c res) ys
    _               -> res

This is very much to what I have sketched! Also, very cool that it avoids the initial splitting! — mike3996, Dec 02 '10 at 16:50

score 3 · Answer 3 · answered Dec 02 '10 at 19:51

Here's a Clojure version. This probably blows the stack for very large inputs. A regex or real parser-generator would be much more concise.

(declare parse*)
(defn slurp-word [words xs terminator]
  (loop [res "" xs xs]
    (condp = (first xs)
      nil  ;; end of string after this word
      (conj words res)

      terminator ;; end of word
      (parse* (conj words res) (rest xs))

      ;; else
      (recur (str res (first xs)) (rest xs)))))

(defn parse* [words xs]
  (condp = (first xs)
    nil ;; end of string
    words

    \space  ;; skip leading spaces
    (parse* words (rest xs))

    \' ;; start quoted part
    (slurp-word words (rest xs) \')

    ;; else slurp until space
    (slurp-word words xs \space)))

(defn parse [s]
  (parse* [] s))

Your inputs:

user> (doseq [x ["Hello there!"
                 "'A quoted phrase'"
                 "'a' 'b' c d"
                 "'a b' 'c d'"
                 "Mid'dle 'quotes do not concern me'"
                 "'lots    of   spacing' there"]]
        (prn (parse x)))

["Hello" "there!"]
["A quoted phrase"]
["a" "b" "c" "d"]
["a b" "c d"]
["Mid'dle" "quotes do not concern me"]
["lots    of   spacing" "there"]
nil

It is fairly easy to make a version of this which does not blow the stack by using `trampoline`. I'm unable to edit this so I've copied yours and changed it slightly and added an example which blew the stack on my machine prior to the change. — Jake McCrary, Dec 02 '10 at 23:17

score 3 · Answer 4 · answered Dec 02 '10 at 23:19

Was able to modify Brian's to use trampoline to allow it to not run out of stack space. Basically make slurp-word and parse* return functions instead of executing them and then change parse to use trampoline

(defn slurp-word [words xs terminator]
  (loop [res "" xs xs]
    (condp = (first xs)
        nil  ;; end of string after this word
      (conj words res)

      terminator ;; end of word
      #(parse* (conj words res) (rest xs))

      ;; else
      (recur (str res (first xs)) (rest xs)))))

(defn parse* [words xs]
  (condp = (first xs)
      nil ;; end of string
    words

    \space  ;; skip leading spaces
    (parse* words (rest xs))

    \' ;; start quoted part
    #(slurp-word words (rest xs) \')

    ;; else slurp until space
    #(slurp-word words xs \space)))

    (defn parse [s]
      (trampoline #(parse* [] s)))


(defn test-parse []
  (doseq [x ["Hello there!"
             "'A quoted phrase'"
             "'a' 'b' c d"
             "'a b' 'c d'"
             "Mid'dle 'quotes do not concern me'"
             "'lots    of   spacing' there"
             (apply str (repeat 30000 "'lots    of   spacing' there"))]]
    (prn (parse x))))

score 2 · Answer 5 · answered Dec 02 '10 at 14:24

2

There is for example fnparse which allows you to write parser in a functional way.

answered Dec 02 '10 at 14:24

kotarak

17,099
2
49
39

Goran Jovic · Answer 6 · 2010-12-02T14:16:02.573

1

Use regex:

 (defn my-split [string]
  (let [criterion " +(?=([^']*'[^']*')*[^']*$)"]
   (for [s (into [] (.split string criterion))] (.replace s "'" ""))))

The first character in regex is the character by which you want to split your string - here it's at least one whitespace..

And if you want to change the quoting character just change every ' to something else like /".

EDIT: I just saw that you explicitly mentioned you didn't want to use regex. Sorry!

edited Dec 02 '10 at 14:16

answered Dec 02 '10 at 13:27

Goran Jovic

9,418
3
43
75

That's okay. It's anyway neater than my current regex-solution :) – mike3996 Dec 02 '10 at 13:46
you don't pass the test with this. You ["'a b'" "'c d'"] and it should be ["a b", "c d"]. – nickik Dec 02 '10 at 13:47
Indeed. I've just changed it to include a quick fix for that. – Goran Jovic Dec 02 '10 at 14:16

score 1 · Answer 7 · answered Dec 03 '10 at 08:50

Oh my, the answers given seem to outbeat mine now that I got the tests succeed. Anyway I'm posting it here to beg some comments about idiomatizing the code.

I sketched a haskellish pseudo:

pl p w:ws = | if w:ws empty
               => p
            | if w begins with a quote
               => pli p w:ws
            | otherwise
               => pl (p ++ w) ws

pli p w:ws = | if w:ws empty
                => p
             | if w begins with a quote
                => pli (p ++ w) ws
             | if w ends with a quote
                => pl (init p ++ (tail p ++ w)) ws
             | otherwise
                => pli (init p ++ (tail p ++ w)) ws

Okay, badly named. There

Function pl processes the words not quoted
Function pli (i as in inner) processes the quoted phrases
The parameter (list) p is the already processed (done) information
The parameter (list) w:ws is information to be processed

I have translated the pseudo this way:

(def quote-chars '(\" \')) ;'

; rewrite .startsWith and .endsWith to support multiple choices
(defn- starts-with?
  "See if given string begins with selected characters."
  [word choices]
  (some #(.startsWith word (str %)) choices))

(defn- ends-with?
  "See if given string ends with selected characters."
  [word choices]
  (some #(.endsWith word (str %)) choices))

(declare pli)
(defn- pl [p w:ws]
    (let [w (first w:ws)
          ws (rest w:ws)]
     (cond
        (nil? w)
            p
        (starts-with? w quote-chars)
            #(pli p w:ws)
        true
            #(pl (concat p [w]) ws))))

(defn- pli [p w:ws]
    (let [w (first w:ws)
          ws (rest w:ws)]
     (cond
        (nil? w)
            p
        (starts-with? w quote-chars)
            #(pli (concat p [w]) ws)
        (ends-with? w quote-chars)
            #(pl (concat 
                  (drop-last p)
                  [(str (last p) " " w)])
                ws)
        true
            #(pli (concat 
                  (drop-last p)
                  [(str (last p) " " w)])
                ws))))

(defn split-line
    "Split a line by spaces, leave quoted groups intact."
    [input]
    (let [splt (.split input " +")]
        (map strip-input 
            (trampoline pl [] splt))))

Not very Clojuresque, the details. Also I depend on regexp in splitting and stripping the quotes so I should deserve some downvotes due to that.

Functionally split a string by whitespace, group by quotes!

7 Answers7