0

This is probably an incredibly simple question, as well as likely a duplicate (although I did try to check beforehand), but which is less expensive when used in a loop, String.replaceAll() or matcher.replaceAll()?
While I was told

Pattern regexPattern = Pattern.compile("[^a-zA-Z0-9]");
Matcher matcher;
String thisWord;
while (Scanner.hasNext()) {
   matcher = regexPattern.matcher(Scanner.next());
   thisWord = matcher.replaceAll("");
   ...
} 

is better, because you only have to compile the regex once, I would think that the benefits of

String thisWord;
while (Scanner.hasNext()) {
   thisWord = Scanner.next().replaceAll("[^a-zA-Z0-9]","");
   ...
}

far outweigh the matcher method, due to not having to initialize the matcher every time. (I understand the matcher exists already, so you are not recreating it.)

Can someone please explain how my reasoning is false? Am I misunderstanding what Pattern.matcher() does?

Aharon K
  • 312
  • 1
  • 10
  • 1
    Although this one in particular doesn't depend on the machine specific, you can/should benchmark it before asking. – user202729 Sep 22 '20 at 04:43
  • Comment: Pattern.compile [does not cache the result](https://stackoverflow.com/questions/13420321/does-pattern-compile-cache). – user202729 Sep 22 '20 at 04:44

2 Answers2

1

In OpenJDK, String.replaceAll is defined as follows:

    public String replaceAll(String regex, String replacement) {
        return Pattern.compile(regex).matcher(this).replaceAll(replacement);
    }

[code link]

So at least with that implementation, it won't give better performance than compiling the pattern only once and using Matcher.replaceAll.

It's possible that there are other JDK implementations where String.replaceAll is implemented differently, but I'd be very surprised if there were any where it performed better than Matcher.replaceAll.


[…] due to not having to initialize the matcher every time. (I understand the matcher exists already, so you are not recreating it.)

I think you have a misunderstanding here. You really do create a new Matcher instance on each loop iteration; but that is very cheap, and not something to be concerned about performance-wise.


Incidentally, you don't actually need a separate 'matcher' variable if you don't want one; you'll get exactly the same behavior and performance if you write:

   thisWord = regexPattern.matcher(Scanner.next()).replaceAll("");
ruakh
  • 175,680
  • 26
  • 273
  • 307
  • Doesn't `Matcher matcher;` create the variable, and `Pattern.matcher()` initializes it, or is the variable naming just to sort of reserve the name, and not much else? – Aharon K Sep 22 '20 at 04:44
  • @AharonKatz You may want to read a book or something for that. It seems that you misunderstood the concept. In java a variable (with object type) is actually a reference. – user202729 Sep 22 '20 at 04:48
  • @AharonKatz: Terminology-wise, we say that `Matcher matcher;` *declares* the variable, and that `matcher = ...;` *assigns* a value to the variable. (The latter "initializes" the variable only the first time it's invoked.) Both are nearly free. Creating the instance of Matcher -- which happens inside the call to Pattern.matcher -- is a bit more expensive, though still quite cheap. – ruakh Sep 22 '20 at 04:52
  • @AharonKatz: My pleasure! Shana tova, BTW. :-) – ruakh Sep 22 '20 at 04:53
  • @ruakh I was curious about the name. You too. – Aharon K Sep 22 '20 at 04:54
  • (Unrelated to this question. There's a discussion starting [here](https://codeforces.com/blog/entry/83100?#comment-703247) that might interest you about the XOR-every-subarray-to-zero question that was deleted,) – גלעד ברקן Sep 28 '20 at 11:49
0

There is a more efficient way if you reset the same matcher, then it is not regenerated on each occasion inside the loop which makes a copy of most of the same information relating to the Pattern structure.

Pattern regexPattern = Pattern.compile("[^a-zA-Z0-9]");
Matcher matcher = regexPattern.matcher("");
String thisWord;
while (Scanner.hasNext()) {
   matcher = matcher.reset(Scanner.next());
   thisWord = matcher.replaceAll("");
   // ...
} 

There is a one-off cost to create the matcher outside the loop regexPattern.matcher("") but the calls to matcher.reset(xxx) will be quicker because they re-use that matcher rather than re-generating a new matcher instance each time. This reduces the amount of GC required.

DuncG
  • 12,137
  • 2
  • 21
  • 33