104

Is there a standard (preferably Apache Commons or similarly non-viral) library for doing "glob" type matches in Java? When I had to do similar in Perl once, I just changed all the "." to "\.", the "*" to ".*" and the "?" to "." and that sort of thing, but I'm wondering if somebody has done the work for me.

Similar question: Create regex from glob expression

Community
  • 1
  • 1
Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
  • Could you give a precise example of what you want to do? – Thorbjørn Ravn Andersen Aug 08 '09 at 08:58
  • What I want to do (or rather what my client wants to do) is match things like "*-2009/" or "*rss*" in urls. Mostly it's pretty trivial to convert to regex, but I wondered if there was an easier way. – Paul Tomblin Aug 08 '09 at 10:50
  • I recommend Ant style file globing as it seems to have become the canonical globing in the Java world. See my answer for more details: http://stackoverflow.com/questions/1247772/is-there-an-equivalent-of-java-util-regex-for-glob-type-patterns/4038104#4038104 . – Adam Gent Oct 27 '10 at 22:08
  • Related: http://stackoverflow.com/questions/794381/how-to-find-files-that-match-a-wildcard-string-in-java – Brad Mace Sep 03 '12 at 21:07
  • 1
    @BradMace, related but most of the answers there assume you're traversing a directory tree. Still, if anybody is still looking for how to do glob style matching of arbitrary strings, they should probably look in that answer as well. – Paul Tomblin Sep 03 '12 at 21:16
  • [GlobCompiler](http://jakarta.apache.org/oro/api/org/apache/oro/text/GlobCompiler.html)/[GlobEngine](http://jakarta.apache.org/oro/api/org/apache/oro/text/GlobEngine.html), from [Jakarta ORO](http://jakarta.apache.org/oro/), looks promising. It's available under the Apache License. – Steve Trout Aug 08 '09 at 02:25

13 Answers13

73

Globbing is also planned for implemented in Java 7.

See FileSystem.getPathMatcher(String) and the "Finding Files" tutorial.

finnw
  • 47,861
  • 24
  • 143
  • 221
  • 26
    Marvelous. But why on earth this implementation is limited to "Path" objects ?!? In my case, I want to match URI... – Yves Martin Jan 16 '13 at 10:05
  • 3
    Peering at the source of sun.nio, the glob matching appears to be implemented by [Globs.java](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7-b147/sun/nio/fs/Globs.java). Unfortunately, this is written specifically for filesystem paths, so it can't be used for all strings (it makes some assumptions about path separators and illegal characters). But it may be a helpful starting point. – Neil Traft Mar 19 '13 at 12:49
55

There's nothing built-in, but it's pretty simple to convert something glob-like to a regex:

public static String createRegexFromGlob(String glob)
{
    String out = "^";
    for(int i = 0; i < glob.length(); ++i)
    {
        final char c = glob.charAt(i);
        switch(c)
        {
        case '*': out += ".*"; break;
        case '?': out += '.'; break;
        case '.': out += "\\."; break;
        case '\\': out += "\\\\"; break;
        default: out += c;
        }
    }
    out += '$';
    return out;
}

this works for me, but I'm not sure if it covers the glob "standard", if there is one :)

Update by Paul Tomblin: I found a perl program that does glob conversion, and adapting it to Java I end up with:

    private String convertGlobToRegEx(String line)
    {
    LOG.info("got line [" + line + "]");
    line = line.trim();
    int strLen = line.length();
    StringBuilder sb = new StringBuilder(strLen);
    // Remove beginning and ending * globs because they're useless
    if (line.startsWith("*"))
    {
        line = line.substring(1);
        strLen--;
    }
    if (line.endsWith("*"))
    {
        line = line.substring(0, strLen-1);
        strLen--;
    }
    boolean escaping = false;
    int inCurlies = 0;
    for (char currentChar : line.toCharArray())
    {
        switch (currentChar)
        {
        case '*':
            if (escaping)
                sb.append("\\*");
            else
                sb.append(".*");
            escaping = false;
            break;
        case '?':
            if (escaping)
                sb.append("\\?");
            else
                sb.append('.');
            escaping = false;
            break;
        case '.':
        case '(':
        case ')':
        case '+':
        case '|':
        case '^':
        case '$':
        case '@':
        case '%':
            sb.append('\\');
            sb.append(currentChar);
            escaping = false;
            break;
        case '\\':
            if (escaping)
            {
                sb.append("\\\\");
                escaping = false;
            }
            else
                escaping = true;
            break;
        case '{':
            if (escaping)
            {
                sb.append("\\{");
            }
            else
            {
                sb.append('(');
                inCurlies++;
            }
            escaping = false;
            break;
        case '}':
            if (inCurlies > 0 && !escaping)
            {
                sb.append(')');
                inCurlies--;
            }
            else if (escaping)
                sb.append("\\}");
            else
                sb.append("}");
            escaping = false;
            break;
        case ',':
            if (inCurlies > 0 && !escaping)
            {
                sb.append('|');
            }
            else if (escaping)
                sb.append("\\,");
            else
                sb.append(",");
            break;
        default:
            escaping = false;
            sb.append(currentChar);
        }
    }
    return sb.toString();
}

I'm editing into this answer rather than making my own because this answer put me on the right track.

Paul Tomblin
  • 179,021
  • 58
  • 319
  • 408
Dave Ray
  • 39,616
  • 7
  • 83
  • 82
  • 1
    Yeah, that's pretty much the solution I came up with the last time I had to do this (in Perl) but I was wondering if there was something more elegant. I think I'm going to do it your way. – Paul Tomblin Aug 08 '09 at 14:34
  • 1
    Actually, I found a better implementation in Perl that I can adapt into Java at http://kobesearch.cpan.org/htdocs/Text-Glob/Text/Glob.pm.html – Paul Tomblin Aug 08 '09 at 20:56
  • Couldn't you use a regex replace to turn a glob into a regex? – Tim Sylvester Aug 09 '09 at 01:10
  • 1
    The lines at the top that strip out the leading and trailing '*' need to be removed for java since String.matches against the whole string only – KitsuneYMG Aug 12 '09 at 13:49
  • 10
    FYI: The standard for 'globbing' is the POSIX Shell language - http://www.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_01 – Stephen C Nov 13 '09 at 10:11
  • Stephen C: Thanks for the tip. – Dave Ray Nov 13 '09 at 14:15
  • I think the first snippet of code has a problem if it is passed a glob with mismatched parentheses, e.g. `(*`. I believe `(` is non-special in a glob, and it will get converted to `(.*`, which is not a valid regex. – Simon Nickerson Apr 26 '10 at 10:01
  • How do I refer to the glob in my String? For example, if I'm checking if referrer equals ``"www.google.com" + *``; – Martin Erlic May 27 '16 at 09:52
38

Thanks to everyone here for their contributions. I wrote a more comprehensive conversion than any of the previous answers:

/**
 * Converts a standard POSIX Shell globbing pattern into a regular expression
 * pattern. The result can be used with the standard {@link java.util.regex} API to
 * recognize strings which match the glob pattern.
 * <p/>
 * See also, the POSIX Shell language:
 * http://pubs.opengroup.org/onlinepubs/009695399/utilities/xcu_chap02.html#tag_02_13_01
 * 
 * @param pattern A glob pattern.
 * @return A regex pattern to recognize the given glob pattern.
 */
public static final String convertGlobToRegex(String pattern) {
    StringBuilder sb = new StringBuilder(pattern.length());
    int inGroup = 0;
    int inClass = 0;
    int firstIndexInClass = -1;
    char[] arr = pattern.toCharArray();
    for (int i = 0; i < arr.length; i++) {
        char ch = arr[i];
        switch (ch) {
            case '\\':
                if (++i >= arr.length) {
                    sb.append('\\');
                } else {
                    char next = arr[i];
                    switch (next) {
                        case ',':
                            // escape not needed
                            break;
                        case 'Q':
                        case 'E':
                            // extra escape needed
                            sb.append('\\');
                        default:
                            sb.append('\\');
                    }
                    sb.append(next);
                }
                break;
            case '*':
                if (inClass == 0)
                    sb.append(".*");
                else
                    sb.append('*');
                break;
            case '?':
                if (inClass == 0)
                    sb.append('.');
                else
                    sb.append('?');
                break;
            case '[':
                inClass++;
                firstIndexInClass = i+1;
                sb.append('[');
                break;
            case ']':
                inClass--;
                sb.append(']');
                break;
            case '.':
            case '(':
            case ')':
            case '+':
            case '|':
            case '^':
            case '$':
            case '@':
            case '%':
                if (inClass == 0 || (firstIndexInClass == i && ch == '^'))
                    sb.append('\\');
                sb.append(ch);
                break;
            case '!':
                if (firstIndexInClass == i)
                    sb.append('^');
                else
                    sb.append('!');
                break;
            case '{':
                inGroup++;
                sb.append('(');
                break;
            case '}':
                inGroup--;
                sb.append(')');
                break;
            case ',':
                if (inGroup > 0)
                    sb.append('|');
                else
                    sb.append(',');
                break;
            default:
                sb.append(ch);
        }
    }
    return sb.toString();
}

And the unit tests to prove it works:

/**
 * @author Neil Traft
 */
public class StringUtils_ConvertGlobToRegex_Test {

    @Test
    public void star_becomes_dot_star() throws Exception {
        assertEquals("gl.*b", StringUtils.convertGlobToRegex("gl*b"));
    }

    @Test
    public void escaped_star_is_unchanged() throws Exception {
        assertEquals("gl\\*b", StringUtils.convertGlobToRegex("gl\\*b"));
    }

    @Test
    public void question_mark_becomes_dot() throws Exception {
        assertEquals("gl.b", StringUtils.convertGlobToRegex("gl?b"));
    }

    @Test
    public void escaped_question_mark_is_unchanged() throws Exception {
        assertEquals("gl\\?b", StringUtils.convertGlobToRegex("gl\\?b"));
    }

    @Test
    public void character_classes_dont_need_conversion() throws Exception {
        assertEquals("gl[-o]b", StringUtils.convertGlobToRegex("gl[-o]b"));
    }

    @Test
    public void escaped_classes_are_unchanged() throws Exception {
        assertEquals("gl\\[-o\\]b", StringUtils.convertGlobToRegex("gl\\[-o\\]b"));
    }

    @Test
    public void negation_in_character_classes() throws Exception {
        assertEquals("gl[^a-n!p-z]b", StringUtils.convertGlobToRegex("gl[!a-n!p-z]b"));
    }

    @Test
    public void nested_negation_in_character_classes() throws Exception {
        assertEquals("gl[[^a-n]!p-z]b", StringUtils.convertGlobToRegex("gl[[!a-n]!p-z]b"));
    }

    @Test
    public void escape_carat_if_it_is_the_first_char_in_a_character_class() throws Exception {
        assertEquals("gl[\\^o]b", StringUtils.convertGlobToRegex("gl[^o]b"));
    }

    @Test
    public void metachars_are_escaped() throws Exception {
        assertEquals("gl..*\\.\\(\\)\\+\\|\\^\\$\\@\\%b", StringUtils.convertGlobToRegex("gl?*.()+|^$@%b"));
    }

    @Test
    public void metachars_in_character_classes_dont_need_escaping() throws Exception {
        assertEquals("gl[?*.()+|^$@%]b", StringUtils.convertGlobToRegex("gl[?*.()+|^$@%]b"));
    }

    @Test
    public void escaped_backslash_is_unchanged() throws Exception {
        assertEquals("gl\\\\b", StringUtils.convertGlobToRegex("gl\\\\b"));
    }

    @Test
    public void slashQ_and_slashE_are_escaped() throws Exception {
        assertEquals("\\\\Qglob\\\\E", StringUtils.convertGlobToRegex("\\Qglob\\E"));
    }

    @Test
    public void braces_are_turned_into_groups() throws Exception {
        assertEquals("(glob|regex)", StringUtils.convertGlobToRegex("{glob,regex}"));
    }

    @Test
    public void escaped_braces_are_unchanged() throws Exception {
        assertEquals("\\{glob\\}", StringUtils.convertGlobToRegex("\\{glob\\}"));
    }

    @Test
    public void commas_dont_need_escaping() throws Exception {
        assertEquals("(glob,regex),", StringUtils.convertGlobToRegex("{glob\\,regex},"));
    }

}
Neil Traft
  • 18,367
  • 15
  • 63
  • 70
12

There are couple of libraries that do Glob-like pattern matching that are more modern than the ones listed:

Theres Ants Directory Scanner And Springs AntPathMatcher

I recommend both over the other solutions since Ant Style Globbing has pretty much become the standard glob syntax in the Java world (Hudson, Spring, Ant and I think Maven).

Adam Gent
  • 47,843
  • 23
  • 153
  • 203
  • 3
    Here are the Maven coordinates for the artifact with AntPathMatcher: https://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.springframework%22%20AND%20a%3A%22spring-core%22 And some tests with sample usage: https://github.com/spring-projects/spring-framework/blob/master/spring-core/src/test/java/org/springframework/util/AntPathMatcherTests.java – seanf Apr 11 '16 at 08:01
  • And you can customise the "path" character... so it's useful for things other than paths... – Michael Wiles Sep 21 '16 at 08:12
8

I recently had to do it and used \Q and \E to escape the glob pattern:

private static Pattern getPatternFromGlob(String glob) {
  return Pattern.compile(
    "^" + Pattern.quote(glob)
            .replace("*", "\\E.*\\Q")
            .replace("?", "\\E.\\Q") 
    + "$");
}
dimo414
  • 47,227
  • 18
  • 148
  • 244
Vincent Robert
  • 35,564
  • 14
  • 82
  • 119
  • 4
    Won't this break if there's a \E somewhere in the string? – jmo Jan 12 '11 at 15:29
  • @jmo, yes, but you can circumvent that by pre-processing the `glob` variable with glob = Pattern.quote(glob), which I believe handles such edge cases. In that case, though, you don't need to prepend and append the first and last \\Q and \\E. – Kimball Robinson Apr 20 '16 at 21:30
  • 2
    @jmo I've fixed the example to use Pattern.quote(). – dimo414 Dec 29 '16 at 23:24
  • In a glob a negative character class uses `!` instead of `^` as the first character after the `[` doesn't it? – Jerry Jeremiah Mar 11 '21 at 04:48
  • See also [this other answer](https://stackoverflow.com/a/59994579/394431) which gives the same result without assuming that `Pattern.quote` is implemented using `\Q` and `\E`. – Robert Tupelo-Schneck Dec 22 '22 at 16:36
6

This is a simple Glob implementation which handles * and ? in the pattern

public class GlobMatch {
    private String text;
    private String pattern;

    public boolean match(String text, String pattern) {
        this.text = text;
        this.pattern = pattern;

        return matchCharacter(0, 0);
    }

    private boolean matchCharacter(int patternIndex, int textIndex) {
        if (patternIndex >= pattern.length()) {
            return false;
        }

        switch(pattern.charAt(patternIndex)) {
            case '?':
                // Match any character
                if (textIndex >= text.length()) {
                    return false;
                }
                break;

            case '*':
                // * at the end of the pattern will match anything
                if (patternIndex + 1 >= pattern.length() || textIndex >= text.length()) {
                    return true;
                }

                // Probe forward to see if we can get a match
                while (textIndex < text.length()) {
                    if (matchCharacter(patternIndex + 1, textIndex)) {
                        return true;
                    }
                    textIndex++;
                }

                return false;

            default:
                if (textIndex >= text.length()) {
                    return false;
                }

                String textChar = text.substring(textIndex, textIndex + 1);
                String patternChar = pattern.substring(patternIndex, patternIndex + 1);

                // Note the match is case insensitive
                if (textChar.compareToIgnoreCase(patternChar) != 0) {
                    return false;
                }
        }

        // End of pattern and text?
        if (patternIndex + 1 >= pattern.length() && textIndex + 1 >= text.length()) {
            return true;
        }

        // Go on to match the next character in the pattern
        return matchCharacter(patternIndex + 1, textIndex + 1);
    }
}
Tony Edgecombe
  • 3,860
  • 3
  • 28
  • 34
4

Similar to Tony Edgecombe's answer, here is a short and simple globber that supports * and ? without using regex, if anybody needs one.

public static boolean matches(String text, String glob) {
    String rest = null;
    int pos = glob.indexOf('*');
    if (pos != -1) {
        rest = glob.substring(pos + 1);
        glob = glob.substring(0, pos);
    }

    if (glob.length() > text.length())
        return false;

    // handle the part up to the first *
    for (int i = 0; i < glob.length(); i++)
        if (glob.charAt(i) != '?' 
                && !glob.substring(i, i + 1).equalsIgnoreCase(text.substring(i, i + 1)))
            return false;

    // recurse for the part after the first *, if any
    if (rest == null) {
        return glob.length() == text.length();
    } else {
        for (int i = glob.length(); i <= text.length(); i++) {
            if (matches(text.substring(i), rest))
                return true;
        }
        return false;
    }
}
Community
  • 1
  • 1
mihi
  • 6,507
  • 1
  • 38
  • 48
4

It may be a slightly hacky approach. I've figured it out from NIO2's Files.newDirectoryStream(Path dir, String glob) code. Pay attention that every match new Path object is created. So far I was able to test this only on Windows FS, however, I believe it should work on Unix as well.

// a file system hack to get a glob matching
PathMatcher matcher = ("*".equals(glob)) ? null
    : FileSystems.getDefault().getPathMatcher("glob:" + glob);

if ("*".equals(glob) || matcher.matches(Paths.get(someName))) {
    // do you stuff here
}

UPDATE Works on both - Mac and Linux.

Andrii Karaivanskyi
  • 1,942
  • 3
  • 19
  • 23
3

The previous solution by Vincent Robert/dimo414 relies on Pattern.quote() being implemented in terms of \Q...\E, which is not documented in the API and therefore may not be the case for other/future Java implementations. The following solution removes that implementation dependency by escaping all occurrences of \E instead of using quote(). It also activates DOTALL mode ((?s)) in case the string to be matched contains newlines.

    public static Pattern globToRegex(String glob)
    {
        return Pattern.compile(
            "(?s)^\\Q" +
            glob.replace("\\E", "\\E\\\\E\\Q")
                .replace("*", "\\E.*\\Q")
                .replace("?", "\\E.\\Q") +
            "\\E$"
        );
    }
nmatt
  • 451
  • 4
  • 10
2

I don't know about a "standard" implementation, but I know of a sourceforge project released under the BSD license that implemented glob matching for files. It's implemented in one file, maybe you can adapt it for your requirements.

Greg Mattes
  • 33,090
  • 15
  • 73
  • 105
  • Updated link: https://sourceforge.net/p/uncle/code/HEAD/tree/uncle/fileglob/trunk/src/com/uncle/fileglob/FileGlob.java – seanf Apr 11 '16 at 07:24
1

There is sun.nio.fs.Globs but it is not part of the public API. You can use it indirectly via:

FileSystems.getDefault().getPathMatcher("glob:<myPattern>") 

But it returns PathMatcher, which is inconvenient to work with. Since it can accept only Path as parameter (not File).

One possible option is to convert the PathMatcher to regex pattern (just call its 'toString()' method).

Another option is to use dedicated Glob library like glob-library-java.

Dimitar II
  • 2,299
  • 32
  • 33
  • `One possible option is to convert the PathMatcher to regex pattern (just call its 'toString()' method)` => not working for me. Using OpenJDK 11.0.15, `java.nio.file.FileSystems.getDefault().getPathMatcher("glob:x*").toString()` evaluates to `"sun.nio.fs.UnixFileSystem$3@13deb50e"` – Simon Kissane Apr 13 '23 at 00:17
0

Long ago I was doing a massive glob-driven text filtering so I've written a small piece of code (15 lines of code, no dependencies beyond JDK). It handles only '*' (was sufficient for me), but can be easily extended for '?'. It is several times faster than pre-compiled regexp, does not require any pre-compilation (essentially it is a string-vs-string comparison every time the pattern is matched).

Code:

  public static boolean miniglob(String[] pattern, String line) {
    if (pattern.length == 0) return line.isEmpty();
    else if (pattern.length == 1) return line.equals(pattern[0]);
    else {
      if (!line.startsWith(pattern[0])) return false;
      int idx = pattern[0].length();
      for (int i = 1; i < pattern.length - 1; ++i) {
        String patternTok = pattern[i];
        int nextIdx = line.indexOf(patternTok, idx);
        if (nextIdx < 0) return false;
        else idx = nextIdx + patternTok.length();
      }
      if (!line.endsWith(pattern[pattern.length - 1])) return false;
      return true;
    }
  }

Usage:

  public static void main(String[] args) {
    BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
    try {
      // read from stdin space separated text and pattern
      for (String input = in.readLine(); input != null; input = in.readLine()) {
        String[] tokens = input.split(" ");
        String line = tokens[0];
        String[] pattern = tokens[1].split("\\*+", -1 /* want empty trailing token if any */);

        // check matcher performance
        long tm0 = System.currentTimeMillis();
        for (int i = 0; i < 1000000; ++i) {
          miniglob(pattern, line);
        }
        long tm1 = System.currentTimeMillis();
        System.out.println("miniglob took " + (tm1-tm0) + " ms");

        // check regexp performance
        Pattern reptn = Pattern.compile(tokens[1].replace("*", ".*"));
        Matcher mtchr = reptn.matcher(line);
        tm0 = System.currentTimeMillis();
        for (int i = 0; i < 1000000; ++i) {
          mtchr.matches();
        }
        tm1 = System.currentTimeMillis();
        System.out.println("regexp took " + (tm1-tm0) + " ms");

        // check if miniglob worked correctly
        if (miniglob(pattern, line)) {
          System.out.println("+ >" + line);
        }
        else {
          System.out.println("- >" + line);
        }
      }
    } catch (IOException e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
    }
  }

Copy/paste from here

bobah
  • 18,364
  • 2
  • 37
  • 70
-2

By the way, it seems as if you did it the hard way in Perl

This does the trick in Perl:

my @files = glob("*.html")
# Or, if you prefer:
my @files = <*.html> 
  • 1
    That only works if the glob is for matching files. In the perl case, the globs actually came from a list of ip addresses that was written using globs for reasons I won't go into, and in my current case the globs were to match urls. – Paul Tomblin Sep 01 '09 at 07:12