1

I am trying to perform multiple string replacements using Java's Pattern and Matcher, where the regex pattern may include metacharacters (e.g. \b, (), etc.). For example, for the input string fit i am, I would like to apply the replacements:

\bi\b --> EYE
i     --> I

I then followed the coding pattern from two questions (Java Replacing multiple different substring in a string at once, Replacing multiple substrings in Java when replacement text overlaps search text). In both, they create an or'ed search pattern (e.g foo|bar) and a Map of (pattern, replacement), and inside the matcher.find() loop, they look up and apply the replacement.

The problem I am having is that the matcher.group() function does not contain information on matching metacharacters, so I cannot distinguish between i and \bi\b. Please see the code below. What can I do to fix the problem?

import java.util.regex.Matcher;    
import java.util.regex.Pattern;
import java.util.*;

public class ReplacementExample
{
    public static void main(String argv[])
    {
        Map<String, String> replacements = new HashMap<String, String>();
        replacements.put("\\bi\\b", "EYE");
        replacements.put("i", "I");

        String input = "fit i am";

        String result = doit(input, replacements);

        System.out.printf("%s\n", result);
    }


    public static String doit(String input, Map<String, String> replacements)
    {
        String patternString = join(replacements.keySet(), "|");
        Pattern pattern = Pattern.compile(patternString);
        Matcher matcher = pattern.matcher(input);
        StringBuffer resultStringBuffer = new StringBuffer();

        while (matcher.find())
        {
            System.out.printf("match found: %s at start: %d, end: %d\n",
                matcher.group(), matcher.start(), matcher.end());

            String matchedPattern = matcher.group();
            String replaceWith = replacements.get(matchedPattern);

            // Do the replacement here.
            matcher.appendReplacement(resultStringBuffer, replaceWith);
        }

        matcher.appendTail(resultStringBuffer);

        return resultStringBuffer.toString();
    }

    private static String join(Set<String> set, String delimiter)
    {
        StringBuilder sb = new StringBuilder();
        int numElements = set.size();
        int i = 0;

        for (String s : set)
        {
            sb.append(Pattern.quote(s));
            if (i++ < numElements-1) { sb.append(delimiter); }
        }

        return sb.toString();
    }
}

This prints out:

match found: i at start: 1, end: 2
match found: i at start: 4, end: 5
fIt I am

Ideally, it should be fIt EYE am.

Community
  • 1
  • 1
stackoverflowuser2010
  • 38,621
  • 48
  • 169
  • 217
  • is performance enough of a problem that you can't just loop through the replacements? – thagorn May 15 '12 at 18:53
  • The replacements may overlap with each other. The second linked StackOverflow question I provided addresses this ("Replacing multiple substrings in Java when replacement text overlaps search text"), so I used its solution. – stackoverflowuser2010 May 15 '12 at 18:55
  • In that case you may have to loop through and use dummy characters. (Replace \\wi\\w with $ then replace \\bi\\b with EYE then replace $ with I) – thagorn May 15 '12 at 19:02

2 Answers2

0

You mistyped one of your regexes:

    replacements.put("\\bi\\", "EYE"); //Should be \\bi\\b
    replacements.put("i", "I");

You may also want to make your regexes unique. There is no guarantee of order with map.getKeySet() so it may just be replacing i with I before checking \\bi\\b.

thagorn
  • 727
  • 7
  • 14
0

You could use capture groups, without straying too far from your existing design. So instead of using the matched pattern as the key, you look up based on the order within a List.

You would need to change the join method to put parantheses around each of the patterns, something like this:

private static String join(Set<String> set, String delimiter) {
    StringBuilder sb = new StringBuilder();
    sb.append("(");
    int numElements = set.size();
    int i = 0;
    for (String s : set) {
        sb.append(s);
        if (i++ < numElements - 1) {
            sb.append(")");
            sb.append(delimiter);
            sb.append("(");         }
    }
    sb.append(")");
    return sb.toString();
}

As a side note, the use of Pattern.quote in the original code listing would have caused the match to fail where those metacharacters were present.

Having done this, you would now need to determine which of the capture groups was responsible for the match. For simplicity I'm going to assume that none of the match patterns will themselves contain capture groups, in which case something like this would work, within the matcher while loop:

        int index = -1;
        for (int j=1;j<=replacements.size();j++){
            if (matcher.group(j) != null) {
                index = j;
                break;
            }

        }
        if (index >= 0) {
            System.out.printf("Match on index %d = %s %d %d\n", index, matcher.group(index), matcher.start(index), matcher.end(index));
        }

Next, we would like to use the resulting index value to index straight back into the replacements. The original code uses a HashMap, which is not suitable for this; you're going to have to refactor that to use a pair of Lists in some form, one containing the list of match patterns and the other the corresponding list of replacement strings. I won't do that here, but I hope that provides enough detail to create a working solution.

jrhwhipp
  • 1
  • 1