shlex alternative for Java

Question

Is there a shlex alternative for Java? I'd like to be able to split quote delimited strings like the shell would process them. For example, if I'd send :

one two "three four"

and perform a split, I'd like to receive the tokens

one
two
three four

Notably -- "like the shell would process them" is a fairly hard task; `shlex` does it well, but many naive algorithms won't. For instance, in shell, `"three four"` and `"three"' 'four` are exactly equivalent, as is `three\ four`. — Charles Duffy, Feb 05 '13 at 17:09

Ray Myers · Answer 1 · 2019-09-09T20:37:32.733

10

I had a similar problem today, and it didn't look like any standard options such as StringTokenizer, StrTokenizer, Scanner were a good fit. However, it's not hard to implement the basics.

This example handles all the edge cases currently commented on other answers. Be warned, I haven't checked it for full POSIX compliance yet. Gist including unit tests available on GitHub - released in public domain via the unlicense.

public List<String> shellSplit(CharSequence string) {
    List<String> tokens = new ArrayList<String>();
    boolean escaping = false;
    char quoteChar = ' ';
    boolean quoting = false;
    int lastCloseQuoteIndex = Integer.MIN_VALUE;
    StringBuilder current = new StringBuilder();
    for (int i = 0; i<string.length(); i++) {
        char c = string.charAt(i);
        if (escaping) {
            current.append(c);
            escaping = false;
        } else if (c == '\\' && !(quoting && quoteChar == '\'')) {
            escaping = true;
        } else if (quoting && c == quoteChar) {
            quoting = false;
            lastCloseQuoteIndex = i;
        } else if (!quoting && (c == '\'' || c == '"')) {
            quoting = true;
            quoteChar = c;
        } else if (!quoting && Character.isWhitespace(c)) {
            if (current.length() > 0 || lastCloseQuoteIndex == (i - 1)) {
                tokens.add(current.toString());
                current = new StringBuilder();
            }
        } else {
            current.append(c);
        }
    }
    if (current.length() > 0 || lastCloseQuoteIndex == (string.length() - 1)) {
        tokens.add(current.toString());
    }

    return tokens;
}

edited Sep 09 '19 at 20:37

answered Dec 22 '13 at 00:44

Ray Myers

533
5
14

Would you consider attaching a license to this (or explicitly donating it to the public domain)? – Charles Duffy Mar 05 '14 at 22:27
Ah, there it is, last line of this page: user contributions licensed under cc by-sa 3.0 with attribution required – bukzor Mar 10 '14 at 17:58
@RayMyers: We still need to know whether this is your own work, otherwise the license is unknown. Also, the CC-BY-SA license isn't completely compatible with Hadoop's Apache license ([I would need to use it unmodified](http://www.apache.org/legal/resolved.html#cc-sa)). If you'd dedicate this code under [the Unlicense](http://unlicense.org/) these problems go away, otherwise I'll have to write similar from scratch. ...I wish SO would change their default license. – bukzor Mar 10 '14 at 19:05
bukzor and others: Thanks for pointing this out. Yes, it is my work. I've updated it to be explicitly public domain. – Ray Myers Mar 12 '14 at 20:02
@RayMyers: While this is good enough for me (thanks!), you should know that the expert advice I've often seen is that the 'public domain' is a legal concept on very shaky foundation (eg it doesn't even exist outside the US), and any work without a license (including those "released to the public domain") are best considered to have [NoLicense](http://choosealicense.com/licenses/no-license/). The license closest to what you are trying to do is [the Unlicense](http://unlicense.org/). – bukzor Mar 12 '14 at 21:32
While it surprises me, this appears to be the best code Java has to offer for this problem. Enjoy your bounty :) – bukzor Mar 12 '14 at 21:32
1

Beware: this code improperly handles quoted empty strings. e.g. the input `"''"` will get parsed to an empty list rather than a list containing `""`. – j3h Aug 29 '19 at 17:59
@j3h: Good catch. Updated and added unit tests in the Gist. – Ray Myers Sep 09 '19 at 20:38

score 6 · Answer 2 · answered Jul 04 '09 at 23:11

6

Look at Apache Commons Lang:

org.apache.commons.lang.text.StrTokenizer should be able to do what you want:

new StringTokenizer("one two \"three four\"", ' ', '"').getTokenArray();

answered Jul 04 '09 at 23:11

ChssPly76

99,456
24
206
195

2

Unfortunately, unlike `shlex`, commons.lang is not POSIX compatible. `(-> (StrTokenizer. "\"foo\"'bar'baz") (.getTokenList))` returns a single entry containing `"foo"'bar'baz`, as opposed to the (correct) `foobarbaz`. – Charles Duffy Feb 05 '13 at 17:02
@CharlesDuffy do you know the true answer? – bukzor Mar 05 '14 at 21:41
@bukzor, that presumes that there *is* one. To my knowledge, such a tool has not been written at this time, short of using Python's shlex from Java via Jython (possible, but rather a large dependency chain to pull in). – Charles Duffy Mar 05 '14 at 22:25
...though the answer from @RayMyers looks like a possible candidate. – Charles Duffy Mar 05 '14 at 22:26

score 0 · Answer 3 · answered Jul 19 '20 at 22:41

I had success using the following Scala code using fastparse. I can't vouch for it being complete:

val kvParser = {
  import fastparse._
  import NoWhitespace._
  def nonQuoteChar[_:P] = P(CharPred(_ != '"'))
  def quotedQuote[_:P] = P("\\\"")
  def quotedElement[_:P] = P(nonQuoteChar | quotedQuote)
  def quotedContent[_:P] = P(quotedElement.rep)
  def quotedString[_:P] = P("\"" ~/ quotedContent.! ~ "\"")
  def alpha[_:P] = P(CharIn("a-zA-Z"))
  def digit[_:P] = P(CharIn("0-9"))
  def hyphen[_:P] = P("-")
  def underscore[_:P] = P("_")
  def bareStringChar[_:P] = P(alpha | digit | hyphen | underscore)
  def bareString[_:P] = P(bareStringChar.rep.!)
  def string[_:P] = P(quotedString | bareString)
  def kvPair[_:P] = P(string ~ "=" ~ string)
  def commaAndSpace[_:P] = P(CharIn(" \t\n\r").rep ~ "," ~ CharIn(" \t\n\r").rep)
  def kvPairList[_:P] = P(kvPair.rep(sep = commaAndSpace))
  def fullLang[_:P] = P(kvPairList ~ End)

  def res(str: String) = {
    parse(str, fullLang(_))
  }

  res _
}

shlex alternative for Java

3 Answers3