4

Background

On every large, commercial Java project I've worked on, I come across numerous usages of Pattern.compile(...) even in code segments which are re-used many times, e.g.

public String rewriteUrlWhichIsDoneABajillionTimes(final String requestedUrl) {
    Matcher m = Pattern.compile("^/([^/]+)\\.html$").matcher(requestedUrl);
    if (!m.matches()) {
        return null;
    }

    // Do processing here
    ...
}

For every project on which I found things like this, I told at least one person whom I was working with that Pattern.compile(...) is very slow and is not cached but that the java.util.regex.Pattern class is thread-safe and so it can be safely re-used, and each time they informed me that they did not know these things.

Potential solutions

Correct future usage of the API

One "solution" could be to (try to) force people to read the Java standard library documentation and to use the standard library "correctly", but prescriptive methods often to not work so well.

Correct past usage of the API

Alternatively (or complementarily), it would be possible to "clean up" any bad usages of Pattern.compile(...) wherever they are found, but this is likely to be a never-ending task, since (according to my experience) people will continue to use Pattern.compile(...) incorrectly over and over again...

Correct the API

So why not then simply change the Pattern.compile(...) method so that it pools objects and returns the same instance for equivalent input?-- this would instantaneously apply a fix to possibly billions of lines of code around the world (as long as the respective code is run using a JRE which includes the change). The only possible downside I can imagine is that the software would have a larger memory footprint... but given how much memory most computers have these days, I doubt that this will cause problems anywhere other than in edge cases. On the other hand, a huge number of programs will likely run much faster. So why didn't/doesn't Oracle implement an object pool for Pattern similarly to how they did for strings or for primitives?

Community
  • 1
  • 1
errantlinguist
  • 3,658
  • 4
  • 18
  • 41
  • You might want to ask this on the core-lib-dev mailing list (or search their archive to see if this discussion has surface before). – nhahtdh Aug 12 '15 at 07:41
  • 3
    I'd like to answer with a question: why should they? It wouldn't "fix" billions of lines of code, because they aren't broken. If `Pattern` is noted to be a bottleneck, it can be corrected just like any inefficient part of code would be. – Kayaman Aug 12 '15 at 07:44
  • They expect the people using the API to cache such objects if need be (which frankly makes sense). Same strings , integers being reused makes sense (and hence caching makes sense there). Same patterns / matchers being reused is less common. What if I created a 100 million *different patterns* and created 100 million *different matchers*?. This was done assuming that usage of Patterns would be less common when compared to strings, integers etc. – TheLostMind Aug 12 '15 at 07:58
  • @Kayaman: At least in a happy perfect world, an API defines a contract of usage between the developer of code and the user of said code; Just because calling `Pattern.compile(...)` may not be "broken" doesn't necessarily mean that its usage is not wrong. – errantlinguist Aug 12 '15 at 08:39
  • 1
    @TheLostMind: The Java string/integer/etc. pool has a maximum size, so I don't see how arguing about runaway memory usage is a good argument against implementing an object pool; I still can see no unavoidable negatives from implementing such. – errantlinguist Aug 12 '15 at 08:41
  • @errantlinguist - Integers have a cache of -128 to 127 (by default) and all string literals and interned strings get cached in the string constants pool. If I had a million Pattern / Matcher instances, I would also have to implement a mechanism to specify which ones to cache – TheLostMind Aug 12 '15 at 08:47
  • @errantlinguist Of course people can misuse `Pattern` without the code being broken, but it's not Oracle's job to fix them. If I inherit a project and I identify that one of its bottlenecks is using lots of `Patterns`, I fix the code. I don't expect that Oracle will provide a "fix" just because someone didn't know how to program. – Kayaman Aug 12 '15 at 08:52
  • Btw, [this article cited in the question](http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr) no longer applies for single character split, where the character is not metacharacter, from Java 8 and above. Splitting with space now goes through the fast path, which doesn't create a Pattern. – nhahtdh Aug 12 '15 at 08:56
  • @TheLostMind: Why not e.g. store the first n instances of `Pattern` (e.g. by default 256 as is the default maximum pool size for `Integer`) and then release the resources using the same strategy done for instances managed by a `SoftReference`?-- I still don't see how this is such a complicated task. If the developer wants a more-sophisticated mechanism, he/she could still create his own (as people have to now). P.S. Please stop mentioning `Matcher` because it is not thread-safe and thus not comparable here and I don't want to confuse readers. – errantlinguist Aug 12 '15 at 08:56
  • 1
    @Kayaman: It is nice to believe that Oracle shouldn't accommodate people who "don't know how to program", but, in reality, there is a massive amount of code which is using the Java standard libraries wrongly and there always will be. You can either be prescriptive and try to force people to change, or you can change the API-- which one sounds more effective in the long run? – errantlinguist Aug 12 '15 at 08:59
  • @errantlinguist If the class ain't broken, don't fix it. Especially with the shoddy justifications that you're providing. You have no case here, this is only a "it would be nice if" from your part. – Kayaman Aug 12 '15 at 09:02
  • @Kayaman: So why did Sun/Oracle make a pool for strings and primitives?-- they would also work fine without that functionality. – errantlinguist Aug 12 '15 at 09:24
  • @errantlinguist Because they saw obvious performance benefits for something that affects pretty much every program. It's a completely different thing. You're attempting to propose solutions to a problem that exists only in your head. – Kayaman Aug 12 '15 at 09:33
  • @Kayman: The `String` and `Integer` classes still weren't broken before implementing pooling, so, according to you, they shouldn't have fixed them. Still, from your arguments, it feels like you've never seen an application using `Pattern` in the way described in the OP?-- is my luck with projects just really that bad? – errantlinguist Aug 12 '15 at 09:37
  • The article you pointed to: http://chrononsystems.com/blog/hidden-evils-of-javas-stringsplit-and-stringr doesn't implement proper benchmarking (for example, it doesn't implement any kind of JVM "warmup"). Further, even if we consider the results of this "benchmark" as valid, it means that 200ms diff for 1M runs is 0.02% performance differences per one cycle - which is negligible. The reason regex is relatively expensive operation is due to the portion that *validates* the regex - not creating the pattern object! – Nir Alfasi Sep 22 '15 at 16:01
  • One more thing: there will always be developers that will use existing tools in a wrong way, asking the architects of a language to guard against it by adding layers of caching "just in case" is a waste of memory and as such - doesn't make sense. – Nir Alfasi Sep 22 '15 at 16:04

0 Answers0