4

How can I escape individual regex metacharacters in Java?

For an Android app, I am working with files that contain many characters that regexes consider to have a special meaning. These include \?.()[*\^+' and -. I will be reading in two files:

  1. A dictionary list of words, each on a separate line.
  2. A list of characters that can be used to filter the words in the dictionary list.

A sample of each follows.

Dictionary:

 /it*
 t1*]
 ?\<t
 item

(Yes, these are words. The first three are the contracted Braille ASCII representations of the words "stitch", "teacher", and "thought". Now you know.)

"Letters" to use:

?]*/\<1eitm

I want to include these letters in a regular expression similar to this:

String letters = "?]*/\<1eitm";
Pattern pattern = Pattern.compile("^["+letters+"]{4}$", Pattern.MULTILINE);

My aim is to select all the words from the dictionary list that include only the given characters and are the given length. I cannot control the order in which the requested characters will appear in the file.

If I use only non-metacharacters, like <1eitm, this works fine. Somehow, I need to escape the metacharacters and ensure that characters such as ] and - appear in the right place inside the square brackets.

I could do this manually...but am hoping that there is a built-in command to do this for me. All I have found so far is the Pattern.quote() command, which does not give me the results I want.

Below is a list of all the characters that I may need to use inside the square brackets:

\_-,;:!?.'"()[]@*/\&#%^+<=>~$0123456789abcdefghijklmnopqrstuvwxyz

And here is the barebones code that I am using for my Android test:

package com.example.quote;

import android.app.Activity;
import android.content.res.AssetManager;
import android.os.Bundle;
import android.util.Log;

import java.io.IOException;
import java.io.InputStream;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class MainActivity extends Activity {

    @Override
    protected void onCreate(Bundle savedInstanceState) {
        super.onCreate(savedInstanceState);

        AssetManager am = this.getAssets();
        try {
            String dictionary = readFile(am, "dictionary.txt");
            String regex = readFile(am, "regex.txt");

            regex = "^["+regex+"]{4}$"; // THIS IS WHERE I NEED TO MAKE A CHANGE

            Pattern pattern = Pattern.compile(regex, Pattern.MULTILINE);
            Matcher matcher = pattern.matcher(dictionary);

            while (matcher.find()) {
                Log.d("TEST", matcher.group(0));
            }

        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private String readFile(AssetManager am, String fileName) throws IOException {
        InputStream is = am.open(fileName);

        int size = is.available();
        byte[] buffer = new byte[size];
        is.read(buffer);
        is.close();

        String string = new String(buffer, "UTF-8");

        return string;
    }
}
J0e3gan
  • 8,740
  • 10
  • 53
  • 80
James Newton
  • 6,623
  • 8
  • 49
  • 113

2 Answers2

2

Use Pattern.quote() to quote all the special characters and make them matches literal character. The function is usually implemented by surrounding the String supplied with quoting \Q...\E construct.

In Oracle/OpenJDK (reference) implementation, which surrounds the String with \Q...\E construct, the quoted construct is recognized inside character class from Java 6, so the returned value can be used inside character class.

Android uses ICU implementation, which according to the documentation, also allows \Q and \E to work inside character class. Therefore, regardless of how the Pattern.quote() function works in ICU (adding escape \ or using quoting \Q...\E construct), it should work similar to the reference implementation (Java 6) in this regard.

regex = "^[" + Pattern.quote(regex) + "]{4}$";
nhahtdh
  • 55,989
  • 15
  • 126
  • 162
  • After some experimentation, I find that `regex = "^["+Pattern.quote(regex)+"]{4}$";` does the job for me. Thank you. – James Newton Jan 06 '15 at 10:54
0

Escaping special characters for Java regular expressions is annoying, but not difficult. The reason is that the backslash character \ is an escape character in Java strings, so the literal string "\" is a backslash. But a single backslash is also an escape character in regular expressions, so in a Java regex pattern-matching string, special characters should be "escaped" with a double backslash! Hence, in order to match a question mark character ? your regex would have to include \\?. And to match a single backslash, your regex would have to include \\\\.

Let's take your String as an example:

String letters = "?]*/\<1eitm";

The first five characters here should be escaped - that is, prefixed with a double backslash escape sequence \\:

String letters = "\\?\\]\\*\\/\\\\<1eitm";

And the backslash, itself, as pointed out above, has to be prefixed with the escape sequence and then doubled itself.

Hope this helps.

David Faber
  • 12,277
  • 2
  • 29
  • 40
  • I understand, from what you are not saying, that there is no built-in command that will perform all this escaping automatically. Is that right? In other words, I will have to manually check the input string and add `\\` or `\\\` before any metacharacter that I find. – James Newton Jan 06 '15 at 03:02
  • When I test [here](http://www.regexplanet.com/advanced/java/index.html), I see that the Java String expression that works is this: `"(?m)^[]?*/\\\\<1eitm]{4}$"`. In other words, only the `\` character needs to be triple-escaped, and the `]` character needs to be placed at the beginning. – James Newton Jan 06 '15 at 03:08
  • Yes, my apologies, that is because they are inside the square brackets `[]` bookending the character class. I missed that. – David Faber Jan 06 '15 at 03:33