Replacing lots of substrings in big strings

Question

One of our module's performance relies highly on how we replace substrings in string.

We form a "replacement map" which can contain more than 3500 string pairs and then we apply it with StringUtils.replaceEach(text, searchList, replacementList) to big strings (several MBs).

The keys and values are all unique and in most cases have the same character length (but it's not something we can rely on).

Is there a more sophisticated approach to my task than StringUtils.replaceEach()? Something which may be an overkill for simple replacements solved by replaceEach(), but which is much faster in my "heavy" case.

What is "StringUtils"? can you show implementation or dependency if you use library. — talex, Nov 02 '16 at 18:11
@talex `StringUtils` is a utility class from the (quite handy) Apache Commons Lang library — Bohemian, Nov 02 '16 at 18:12
I don't know any alternative to StringUtils, but even if you find a replacement, it also has to process the similar logic, so you finally need to benchmark both and pick the best out of it. — Vasu, Nov 02 '16 at 18:18
`StringUtils` implementation of `replaceEach` is not intend to be used with massive amount of pairs. It is possible to implement more complex but faster algorithm, but I have no idea about where to find implementation. — talex, Nov 02 '16 at 18:21
Does it make sense to do stuff like that in R at all? I mean even with `StringUtils`. When it comes to string manipulation I tend to switch to Perl — mRcSchwering, Nov 02 '16 at 18:42
@mRcSchwering R? Perl? Did you maybe end up in the wrong tag mistakenly? — nbrooks, Nov 02 '16 at 18:48
@okutane are the strings to be substituted "words" as per the regex definition? — Bohemian, Nov 02 '16 at 19:29
What is changing more frequently - the search terms, or the documents? That is, will your process 100,000 documents for the same 1,000 search terms, or will you process 10 documents with 10,000 different sets of 1,000 search terms? If it's the first, it's already the well-studied multiple search term string matching problem. In the unlikely case it's the second, it's a bit tougher because then the cost of creating the search DFA/tables/whatever starts to dominate. — BeeOnRope, Nov 02 '16 at 20:25
@nbrooks actually yes, don't know how that happened, usually browse through R tags... sry — mRcSchwering, Nov 02 '16 at 20:30

zeppelin · Answer 1 · 2016-11-14T21:50:58.380

You can use a regexp engine, to effectively match your keys against the input string, and replace them.

First, concatenate all your keys with an alternation operator, like that:

var keys = "keyA|keyB|keyC";

Next, compile a pattern:

Pattern pattern = Pattern.compile("(" + keys + ")")

Create a matcher against your input text:

Matcher matcher= pattern.matcher(text);

Now, apply your regexp in a loop, to find all the keys in your text , and use appendReplacement (which is an "inline" string replacement method), to replace all of them with the corresponding value:

StringBuffer sb = new StringBuffer();
while (matcher.find()) {
    matcher.appendReplacement(sb,dictionary.get(matcher.group(0)));
}
matcher.appendTail(sb);

And here you go.

Note that this might look a bit naive at first, but indeed regexp engine is heavily optimized for the task at hand, and as Java regexp implementation also allows for an "inline" replacement, it all works very well.

I did a small benchmark, by applying a list of color names (~200 different color names), as defined in /usr/share/X11/rgb.txt against the "Crime and Punishment" by Fyodor Dostoyevsky, I downloaded from Project Gutenberg (~1MB in size), using the technique described, and it worked around

x12 times faster than StringUtils.replaceEach - 900ms vs 10700ms

for the latter (not counting for the Pattern compilation time).

P.S. if your keys can potentially contain characters, unsafe for regexp , like .^$(), you should use Pattern.quote() before adding them to your pattern.

Sidenote:

This method will replace keys, in the order, they appear in the pattern list, e.g. "a=>1|b=>2|aa=>3" when applied to "welcome to bazaar" will result in "welcome to b1z11r", and not "welcome to b1z3r", if you want the longest match, you should sort your keys lexicographically before adding them to the pattern (i.e. "b|aa|a"). It also applies to your original StringUtils.replaceEach() method.

Update:

The method above should work nice for the problem, as formulated in the original question, i.e. when the size of the replacement map is (relatively) small compared to the input data size.

If, instead you have a very long dictionary, applied to a short text, the linear search as done by StringUtils.replaceEach() can be faster than it.

I've made an additional benchmark illustrating that, by applying a dictionary of 10000 randomly chosen words (+4 characters long):

cat /usr/share/dict/words | grep -E "^.{4,}$" | shuf | head -10000

against the: 1024,2048,4096,8192,16384,32768,65536,131072,262144 and 524288 characters long excerpts from the very same "Crime and Punishment" text.

The results are given below:

text    Ta(ms)  Tb(ms)  Ta/Tb(speed up)
---------------------------------------
1024    99      240     0.4125
2048    43      294     0.1462585
4096    113     721     0.1567267
8192    128     1329    0.0963130
16384   320     2230    0.1434977
32768   2052    3708    0.5533981
65536   6811    6650    1.0242106
131072  32422   12663   2.5603728
262144  150655  23011   6.5470862
524288  614634  29874   20.574211

Ta - StringUtils.replaceEach() time
Tb - matcher.appendReplacement() time

Note the pattern string length is 135537 bytes (all 10000 keys concatenated)

That's interesting, I'm gonna try it against mine set of data. — okutane, Nov 02 '16 at 22:20

Dariusz · Answer 2 · 2016-11-02T18:38:21.777

First of all - if you are talking about optimisation, post your profiling results. It is the only reliable source of information about what should be optimized (see the Third Rule of Optimization).

If you've determined that the string operations do take the most time, then there are two things to keep in mind.

First of all, Java Strings are immutable. Each time you call a replace method you create a new string which, in your case most likely, means a lot of memory allocating. Java's gotten better with it over the years, still, if you can skip it, then do it. I've checked, StringUtils.replaceEach does use a buffer and should be relatively memory efficient. Also, especially with a custom search algorithm from the second note, you could implement a custom solution for replacing. The custom solution may consist of creating your own char buffer for efficient replacing, using StringBuilder/StringBuffer for replacing (you'd have to keep track of lengths of replaces, because calling .toString() before each search on StringBuffer will be as inefficient as replacing the strings manually).

Secondly, there's the search algorithm itself. I do not know which is used by Apache's StringUtils, but Java's default implementation is not optimal. You could use a separate library for searching.

Maybe not a great answer, but an answer. No idea who would downvote that. — GhostCat, Nov 02 '16 at 18:33
I don't get the point of explaining the immutability of strings since `StringUtils` is already being used. — Matthias, Nov 02 '16 at 18:33
@Matthias i did not check StringUtils' implementation. I did now. It is memory-efficient, though it does copy letters one by one... — Dariusz, Nov 02 '16 at 18:39

score 0 · Answer 3 · answered Nov 02 '16 at 18:42

StringUtils is using an O(n * m) algorithm (for every word to be replaced, make the replacement in the input) . When m (the number of words to be substituted) is small, this is effectively O(n) (the size of the input).

However, with a "large" number of substitutions to be checked, you will likely be better off processing each word of input, which will complete in O(n) time.

Map<String, String> subs = new HashMap<>(); // populated
String replaced = Arrays.stream(input.split("\\b")) // O(n)
    .map(w -> subs.getOrDefault(w, w)) // O(1)
    .collect(Collectors.joining("")); // O(n)

Splitting on word boundaries not only preserves whitespace (by not consuming input) but makes the code rather simple.

Pre-seeding the `subs` look up map would be `O(m)`, so this is still `O(n + m)`, right? — nbrooks, Nov 02 '16 at 19:17
That will only work if strings being replaced are full words, but OP only mentions _substrings_. — zeppelin, Nov 02 '16 at 19:18

score 0 · Answer 4 · answered Nov 02 '16 at 18:47

An optimal method for dealing with this situation: Pre-compile the source strings into code. Scan each of your source strings for the replacement keys; break the string into a series of code pieces with a function to insert the key result into a stream. For example: The following source string:

The quick $brown $fox jumped over the $lazy dog.

becomes

public StringBuilder quickBrown(Map<String, String> dict) {
  StringBuilder sb = new StringBuilder();
  sb.append("The quick ");
  sb.append(dict.getOrElse("$brown", "brown"));
  sb.append(" ");
  sb.append(dict.getOrElse("$fox", "fox"));
  sb.append(" jumped over the ");
  sb.append(dict.getOrElse("$lazy", "lazy");
  sb.append(" dog.");
  return sb;
}

Then you invoke the method corresponding to the particular string with the dictionary of mappings you want substituted.

Note that by "scan" and "translate", I mean use a program to generate the Java code and then dynamically load the compiled class files as you need them.

This seems reasonable if the goal was to repeatedly replace into the same string with varying dictionaries, but my assumption is that the replacement strings also vary. So you will spend all your time generating and compiling the code for the replacement (not the OP said the replacement is performed on strings of at least a few MB). — BeeOnRope, Nov 02 '16 at 20:21
Of course, if the keys are constantly changing, the more appropriate strategy is to implement an Aho-Corasick pattern match and replace algorithm. — Bob Dalgleish, Nov 02 '16 at 20:58
I asked the OP which he cares about, but my assumption is there are more documents than replacement sets. — BeeOnRope, Nov 02 '16 at 21:09

score 0 · Answer 5 · edited May 23 '17 at 11:53

The slow part of this algorithm is finding all the matches. The replacement is straightforward if done in a smart way (i.e., in a temporary char buffer, only shifting each character at most once).

So really your question simplifies to the "multi-string search", which is already a well-studied problem. You can find a good summary of the approaches in this question - but the one line summary is "grep does a good job".

Zeppelin already showed a reasonable loop for this - the appendReplacement behavior makes sure you won't be shifting things around unnecessarily (which would degrade this to O(n)).

score 0 · Accepted Answer · answered Nov 13 '16 at 16:56

While appendReplacement solution proposed by @zeppelin was surprisingly fast on "heaviest piece of data" it turned out to be a nightmare with bigger map.

The best solution so far turned out to be a composition of what we had (StringUtils.replaceEach) and what was proposed:

protected BackReplacer createBackReplacer(Map<ReplacementKey, String> replacementMap) {
        if (replacementMap.isEmpty()) {
            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return str;
                }
            };
        }

        if (replacementMap.size() > MAX_SIZE_FOR_REGEX) {
            final String[] searchStrings = new String[replacementMap.size()];
            final String[] replacementStrings = new String[replacementMap.size()];

            int counter = 0;
            for (Map.Entry<ReplacementKey, String> replacementEntry : replacementMap.entrySet()) {
                searchStrings[counter] = replacementEntry.getValue();
                replacementStrings[counter] = replacementEntry.getKey().getValue();
                counter++;
            }

            return new BackReplacer() {
                @Override
                public String backReplace(String str) {
                    return StringUtils.replaceEach(str, searchStrings, replacementStrings);
                }
            };
        }

        final Map<String, String> replacements = new HashMap<>();
        StringBuilder patternBuilder = new StringBuilder();

        patternBuilder.append('(');
        for (Map.Entry<ReplacementKey, String> entry : replacementMap.entrySet()) {
            replacements.put(entry.getValue(), entry.getKey().getValue());
            patternBuilder.append(entry.getValue()).append('|');
        }

        patternBuilder.setLength(patternBuilder.length() - 1);
        patternBuilder.append(')');

        final Pattern pattern = Pattern.compile(patternBuilder.toString());

        return new BackReplacer() {
            @Override
            public String backReplace(String str) {
                if (str.isEmpty()) {
                    return str;
                }

                StringBuffer sb = new StringBuffer(str.length());

                Matcher matcher = pattern.matcher(str);
                while (matcher.find()) {
                    matcher.appendReplacement(sb, replacements.get(matcher.group(0)));
                }
                matcher.appendTail(sb);

                return sb.toString();
            }
        };
    }

StringUtils algorithm (MAX_SIZE_FOR_REGEX=0):

type=TIMER, name=*.run, count=8127, min=4.239809, max=4235197.925261, mean=645.736554, stddev=47197.97968925558, duration_unit=milliseconds

appendReplace algorithm (MAX_SIZE_FOR_REGEX=1000000):

type=TIMER, name=*.run, count=8155, min=4.374516, max=7806145.439165999, mean=1145.757953, stddev=86668.38562815856, duration_unit=milliseconds

Mixed solution (MAX_SIZE_FOR_REGEX=5000):

type=TIMER, name=*.run, count=8155, min=3.5862789999999998, max=376242.25076799997, mean=389.68986564688714, stddev=11733.9997814448, duration_unit=milliseconds

Our data:

type=HISTOGRAM, name=initialValueLength, count=569549, min=0, max=6352327, mean=6268.940661478599, stddev=198123.040651236, median=12.0, p75=16.0, p95=32.0, p98=854.0, p99=1014.5600000000013, p999=6168541.008000023
type=HISTOGRAM, name=replacementMap.size, count=8155, min=0, max=65008, mean=73.46108949416342, stddev=2027.471388983965, median=4.0, p75=7.0, p95=27.549999999999955, p98=55.41999999999996, p99=210.10000000000036, p999=63138.68900000023

This change halved time spend in StringUtils.replaceEach in former solution and gave us 25% performance boost in our module which was mostly IO-bound.

It looks like your first test was applied to less data samples than the last two: "count=8127" vs "count=8155"/"count=8155". Was that intentional ? — zeppelin, Nov 14 '16 at 20:01
Our test suite is growing and unfortunately I don't have much time to rerun everything with the only difference in the algorithms, but appendReplace and mixed solution have been compared in clean environments with any other changes. — okutane, Nov 14 '16 at 20:04
Also, it looks like your data does not really follow a pattern, described in your original question: "3500+ string pairs applied to big strings (several MBs)" ( i.e. replacement dictionary size is relatively small compared to the text it is applied to) Instead, the mean size of your data string is only 6268 bytes (unless I'm misreading the stats) and your replacement map as big as 65008 elements. — zeppelin, Nov 14 '16 at 20:47
When you have a replacement dictionary/pattern much bigger than the text it is applied too, the linear search through the data string would be faster than the RegExp match, and that is pretty much expected. Which makes this a quite a different question, than what you have originally asked about. — zeppelin, Nov 14 '16 at 20:47
Also, I believe that your supposition on the matching speed being a function of map size ("...nightmare with bigger map.."), is not entirely accurate, it is not _F(M)_ (where M is map size), but rather _F(T/M)_ (where T - text size, M is map size (i.e. map size/pattern length relative to the input text size), so you should probably choose the matching algorithm based on this, instead of just a replacement map size, to get an optimal performance. — zeppelin, Nov 14 '16 at 20:56
@zeppelin, as you can see the number of "texts" is bigger than the number of replacement maps, this is because some of the replacements are performed on several texts (additional texts are always tiny). The bigger file of the set is "the one who may cause problems" and its size usually proportional to replacement strings count. — okutane, Nov 14 '16 at 21:04
65008 substrings are only searched in string with 6352327 characters and so far it's the only case that was blown by regex solution (7806145ms instead of original 376242ms) — okutane, Nov 14 '16 at 21:07
> some of the replacements are performed on several texts Then it should much more effective to feed them to the same Matcher in one run, e.g. by first merging them and then splitting back, after replacement is done. — zeppelin, Nov 14 '16 at 21:47
>65008 substrings are only searched in string with 6352327 characters Isn't 7806145ms you full test time (8155) samples ? Can you reproduce this in an isolated test case ? (i.e. one 65000 map against one 6M chunk of data) This does not seem to much the (pretty linear) regex performance I observe (see my update), so it would be interesting to analyze what makes it that slow in your case (BTW have you tried to Pattern.quote() your search strings, to avoid hitting meta-characters, which would naturally bog the regex down ?). — zeppelin, Nov 14 '16 at 21:47
Well it is isolated. I've took few snapshots of Java threads and running call always was in Branch.match. — okutane, Nov 14 '16 at 22:07
7806145ms - is time consumed by processing heaviest piece of data. Take a look at http://metrics.dropwizard.io/ this is how I measure our performance. The replacements strings are alphanumeric so there are no meta-characters. I think it may be the file I'm trying to process. BTW: SO warning me about extended discussion in commments. :D — okutane, Nov 14 '16 at 22:11
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/128100/discussion-between-okutane-and-zeppelin). — okutane, Nov 14 '16 at 22:12

Replacing lots of substrings in big strings

6 Answers6