15

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send :

one two "three four"
and perform a split, I'd like to receive the tokens
one
two
three four
bukzor
  • 37,539
  • 11
  • 77
  • 111
Geo
  • 93,257
  • 117
  • 344
  • 520
  • Notably -- "like the shell would process them" is a fairly hard task; `shlex` does it well, but many naive algorithms won't. For instance, in shell, `"three four"` and `"three"' 'four` are exactly equivalent, as is `three\ four`. – Charles Duffy Feb 05 '13 at 17:09

3 Answers3

10

I had a similar problem today, and it didn't look like any standard options such as StringTokenizer, StrTokenizer, Scanner were a good fit. However, it's not hard to implement the basics.

This example handles all the edge cases currently commented on other answers. Be warned, I haven't checked it for full POSIX compliance yet. Gist including unit tests available on GitHub - released in public domain via the unlicense.

public List<String> shellSplit(CharSequence string) {
    List<String> tokens = new ArrayList<String>();
    boolean escaping = false;
    char quoteChar = ' ';
    boolean quoting = false;
    int lastCloseQuoteIndex = Integer.MIN_VALUE;
    StringBuilder current = new StringBuilder();
    for (int i = 0; i<string.length(); i++) {
        char c = string.charAt(i);
        if (escaping) {
            current.append(c);
            escaping = false;
        } else if (c == '\\' && !(quoting && quoteChar == '\'')) {
            escaping = true;
        } else if (quoting && c == quoteChar) {
            quoting = false;
            lastCloseQuoteIndex = i;
        } else if (!quoting && (c == '\'' || c == '"')) {
            quoting = true;
            quoteChar = c;
        } else if (!quoting && Character.isWhitespace(c)) {
            if (current.length() > 0 || lastCloseQuoteIndex == (i - 1)) {
                tokens.add(current.toString());
                current = new StringBuilder();
            }
        } else {
            current.append(c);
        }
    }
    if (current.length() > 0 || lastCloseQuoteIndex == (string.length() - 1)) {
        tokens.add(current.toString());
    }

    return tokens;
}
Ray Myers
  • 533
  • 5
  • 14
  • Would you consider attaching a license to this (or explicitly donating it to the public domain)? – Charles Duffy Mar 05 '14 at 22:27
  • Ah, there it is, last line of this page: user contributions licensed under cc by-sa 3.0 with attribution required – bukzor Mar 10 '14 at 17:58
  • @RayMyers: We still need to know whether this is your own work, otherwise the license is unknown. Also, the CC-BY-SA license isn't completely compatible with Hadoop's Apache license ([I would need to use it unmodified](http://www.apache.org/legal/resolved.html#cc-sa)). If you'd dedicate this code under [the Unlicense](http://unlicense.org/) these problems go away, otherwise I'll have to write similar from scratch. ...I wish SO would change their default license. – bukzor Mar 10 '14 at 19:05
  • bukzor and others: Thanks for pointing this out. Yes, it is my work. I've updated it to be explicitly public domain. – Ray Myers Mar 12 '14 at 20:02
  • @RayMyers: While this is good enough for me (thanks!), you should know that the expert advice I've often seen is that the 'public domain' is a legal concept on very shaky foundation (eg it doesn't even exist outside the US), and any work without a license (including those "released to the public domain") are best considered to have [NoLicense](http://choosealicense.com/licenses/no-license/). The license closest to what you are trying to do is [the Unlicense](http://unlicense.org/). – bukzor Mar 12 '14 at 21:32
  • While it surprises me, this appears to be the best code Java has to offer for this problem. Enjoy your bounty :) – bukzor Mar 12 '14 at 21:32
  • 1
    Beware: this code improperly handles quoted empty strings. e.g. the input `"''"` will get parsed to an empty list rather than a list containing `""`. – j3h Aug 29 '19 at 17:59
  • @j3h: Good catch. Updated and added unit tests in the Gist. – Ray Myers Sep 09 '19 at 20:38
6

Look at Apache Commons Lang:

org.apache.commons.lang.text.StrTokenizer should be able to do what you want:

new StringTokenizer("one two \"three four\"", ' ', '"').getTokenArray();
ChssPly76
  • 99,456
  • 24
  • 206
  • 195
  • 2
    Unfortunately, unlike `shlex`, commons.lang is not POSIX compatible. `(-> (StrTokenizer. "\"foo\"'bar'baz") (.getTokenList))` returns a single entry containing `"foo"'bar'baz`, as opposed to the (correct) `foobarbaz`. – Charles Duffy Feb 05 '13 at 17:02
  • @CharlesDuffy do you know the true answer? – bukzor Mar 05 '14 at 21:41
  • @bukzor, that presumes that there *is* one. To my knowledge, such a tool has not been written at this time, short of using Python's shlex from Java via Jython (possible, but rather a large dependency chain to pull in). – Charles Duffy Mar 05 '14 at 22:25
  • ...though the answer from @RayMyers looks like a possible candidate. – Charles Duffy Mar 05 '14 at 22:26
0

I had success using the following Scala code using fastparse. I can't vouch for it being complete:

val kvParser = {
  import fastparse._
  import NoWhitespace._
  def nonQuoteChar[_:P] = P(CharPred(_ != '"'))
  def quotedQuote[_:P] = P("\\\"")
  def quotedElement[_:P] = P(nonQuoteChar | quotedQuote)
  def quotedContent[_:P] = P(quotedElement.rep)
  def quotedString[_:P] = P("\"" ~/ quotedContent.! ~ "\"")
  def alpha[_:P] = P(CharIn("a-zA-Z"))
  def digit[_:P] = P(CharIn("0-9"))
  def hyphen[_:P] = P("-")
  def underscore[_:P] = P("_")
  def bareStringChar[_:P] = P(alpha | digit | hyphen | underscore)
  def bareString[_:P] = P(bareStringChar.rep.!)
  def string[_:P] = P(quotedString | bareString)
  def kvPair[_:P] = P(string ~ "=" ~ string)
  def commaAndSpace[_:P] = P(CharIn(" \t\n\r").rep ~ "," ~ CharIn(" \t\n\r").rep)
  def kvPairList[_:P] = P(kvPair.rep(sep = commaAndSpace))
  def fullLang[_:P] = P(kvPairList ~ End)

  def res(str: String) = {
    parse(str, fullLang(_))
  }

  res _
}
Owen
  • 38,836
  • 14
  • 95
  • 125