1

I use Google Caliper to benchmark two methods which check the mdn number in a string. One method uses the user-defined method whereas other uses the regular expression. I am really surprised to find that on average, the regular expression method takes five times longer than the user-defined method.

Here is my benchmarking code.

package com.code4refernce.caliper;

import java.util.Random;
import java.util.regex.Pattern;

import com.google.caliper.Param;
import com.google.caliper.SimpleBenchmark;

public class SimpleCaliperTest extends SimpleBenchmark {
    String extensiveregex = "^\\d?(?:(?:[\\+]?(?:[\\d]{1,3}(?:[ ]+|[\\-.])))?[(]?(?:[\\d]{3})[\\-/)]?(?:[ ]+)?)?(?:[a-zA-Z2-9][a-zA-Z0-9 \\-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\\.]?)){1,2}(?:[\\d]{1,5}))?$";
    Pattern EXTENSIVE_REGEX_PATTERN = Pattern.compile(extensiveregex);

    String mdn[][];
    Random random;
    @Param
    int index;

    @Override
    protected void setUp() {
        random = new Random(0);
        mdn = new String[11][1<<16];
        for (int i=0; i<mdn.length; ++i) {
            mdn[0][i] = String.format("%03ddsfasdf00000", random.nextInt(1000));
            mdn[1][i] = String.format("%04d", random.nextInt(10000));
            mdn[2][i] = String.format("%10d", random.nextInt((int) 1e10));
            mdn[3][i] = String.format("-%10d", random.nextInt((int) 1e10));
            mdn[4][i] = String.format("%10d-", random.nextInt((int) 1e10));
            mdn[5][i] = String.format("%03d-%03d-%03d", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[6][i] = String.format("-%03d-%03d-%03d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[7][i] = String.format("%03d-%03d-%03d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[8][i] = String.format("%03d-%03d-%03d ext %04d", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000), random.nextInt(10000));
            mdn[9][i] = String.format("%03d-%03d-%03d ext %04d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000), random.nextInt(10000));
            mdn[10][i] = "123456789012345677890";
        }
    }

    /**
    *This method benchmark the user defined method to check the mdn.
    **/
    public boolean timeExtensiveSimpleMDNCheck(int reps){
        boolean results = false;
        for(int i = 0; i<reps; i ++){
            for(int index2=0; index2<mdn.length; index2++)
                //Use simple method to check the phone number in string.
            results ^= extensiveMDNCheckRegularMethod(mdn[index][index2]);
        }
        return results;
    }

    /**
    *This method benchmark the regex method.
    **/
    public boolean timeExtensiveMDNRegexCheck(int reps){
        boolean results = false;
        for(int i = 0; i<reps; i ++){
            for(int index2=0; index2<mdn.length; index2++)
                //user Regular expression to check the phone number in string.
            results ^=mdnExtensiveCheckRegEx(mdn[index][index2]);
        }
        return results;
    }

    public boolean extensiveMDNCheckRegularMethod(String mdn){

        //Strip the character which not numeric or 'x' character.
        String stripedmdn = stripString(mdn);

        if(stripedmdn.length() >= 10 && stripedmdn.length() <= 11 && (!stripedmdn.contains("x") || !stripedmdn.contains("X"))){
            //For following condition
            //1-123-456-7868 or 123-456-7868
            return true;
        }else if ( stripedmdn.length() >= 15 && stripedmdn.length() <= 16  ) {
            //1-123-456-7868 ext 2345 or  123-456-7868 ext 2345
            //
            if ( stripedmdn.contains("x") ) {
                int index = stripedmdn.indexOf("x");
                if(index >= 9 && index <= 10){
                    return true;
                }
            }else if( stripedmdn.contains("X") ) {
                int index = stripedmdn.indexOf("X");
                if(index >= 9 && index <= 10){
                    return true;
                }
            }
        }
        return false;
    }

    /**
     * Strip the other character and leave only x and numeric values.
     * @param extendedMdn
     * @return
     */
    public String stripString(String extendedMdn){
        byte mdn[] = new byte[extendedMdn.length()];
        int index = 0;
        for(byte b : extendedMdn.getBytes()){
            if((b >= '0' && b <='9') || b == 'x'){
                mdn[index++] = b;
            }
        }
        return new String(mdn);
    }

    private boolean mdnExtensiveCheckRegEx(String mdn){
        return EXTENSIVE_REGEX_PATTERN.matcher(mdn).matches();
    }
}

And the main class which executes the benchmark:

package com.code4refernce.caliper;
import com.google.caliper.Runner;

public class CaliperRunner {
    public static void main(String[] args) {
        String myargs[] = new String[1];
        myargs[0] = new String("-Dindex=0,1,2,3,4,5,6,7,8,9,10");
        Runner.main(SimpleCaliperTest.class, myargs);
    }
}

And the Caliper benchmark result is as following.

Benchmark index            us  linear runtime
ExtensiveSimpleMDNCheck     0  5.44 =====
ExtensiveSimpleMDNCheck     1  4.34 ====
ExtensiveSimpleMDNCheck     2  5.02 =====
ExtensiveSimpleMDNCheck     3  5.08 =====
ExtensiveSimpleMDNCheck     4  4.92 ====
ExtensiveSimpleMDNCheck     5  4.83 ====
ExtensiveSimpleMDNCheck     6  4.87 ====
ExtensiveSimpleMDNCheck     7  4.72 ====
ExtensiveSimpleMDNCheck     8  5.14 =====
ExtensiveSimpleMDNCheck     9  5.25 =====
ExtensiveSimpleMDNCheck    10  5.57 =====
 ExtensiveMDNRegexCheck     0 17.71 =================
 ExtensiveMDNRegexCheck     1 21.73 =====================
 ExtensiveMDNRegexCheck     2 13.47 =============
 ExtensiveMDNRegexCheck     3  3.37 ===
 ExtensiveMDNRegexCheck     4 12.44 ============
 ExtensiveMDNRegexCheck     5 26.06 ==========================
 ExtensiveMDNRegexCheck     6  3.36 ===
 ExtensiveMDNRegexCheck     7 29.84 ==============================
 ExtensiveMDNRegexCheck     8 23.80 =======================
 ExtensiveMDNRegexCheck     9 24.01 ========================
 ExtensiveMDNRegexCheck    10 20.53 ====================

Am I missing something here? Why does the regular expression take longer to execute?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Rakesh
  • 3,987
  • 10
  • 43
  • 68

1 Answers1

5

A regex engine is only as good as the regexes you feed it, and your regex is very inefficient. I tried it in RegexBuddy with this input:

1-123-456-7868 x2345! 

...where the trailing ! makes sure it fails to match but does a lot of work in the process. Your regex took 142 steps to fail. Then I tweaked it by changing most of the non-capturing groups to atomic groups and making some of the quantifiers possessive, and it only required 35 steps to fail.

Just FYI, if you're going to have performance problems with a regex, it's overwhelmingly likely to be the failed match attempts where you'll see them, not the successful matches. When I remove the ! from the string above, your regex and mine both match in only 34 steps.


On a side note, your stripString() method is wrong in many ways. You should be using a StringBuilder to create the new string, and you should be comparing char values with other chars, not with bytes. Do yourself a favor and forget that the getBytes() method and the String(byte[]) constructor exist. If you must do String-to-byte[] or byte[]-to-String conversions, always use a method that lets you specify a Charset.


EDIT Per the comment below, here's the tweaked regex as a Java string literal:

"^\\d?(?>(?>\\+?(?>\\d{1,3}(?:\\s+|[.-])))?\\(?\\d{3}[/)-]?\\s*)?+(?>[a-zA-Z2-9][a-zA-Z0-9\\s.-]{6,})(?>(?>\\s+|[xX]|(i:ext\\s?)){1,2}\\d{1,5})?+$"

..and in more readable form:

^
\d?
(?>
  (?>
    \+?
    (?>
      \d{1,3}
      (?:\s+|[.-])
    )
  )?
  \(?
  \d{3}
  [/)-]?
  \s*
)?+
(?>[a-zA-Z2-9][a-zA-Z0-9\s.-]{6,})
(?>
  (?>
    \s+
   |
    [xX]
   |
    (i:ext\s?)
  ){1,2}
  \d{1,5}
)?+
$

But I only wrote it to demonstrate the effect of atomic groups and possessive quantifiers; to that end, I left several other problems alone. My point was to demonstrate how much impact a badly written regex can have on the performance of your mdnExtensiveCheckRegEx() method.

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
  • Very good answer! For maximum efficiency I'd recommend `char[]` instead of `StringBuilder`, however either of them is much better than `byte[]`. – maaartinus Aug 31 '12 at 14:20
  • @Alan Thanks for your comment, I will take care of the suggested points. can you please share the tweaked regex, Can you also show me how did you perform the testing on RegexBuddy? – Rakesh Aug 31 '12 at 17:10
  • @Alan thanks for providing the regular expression. is there any tool which can verify the optimization of the regex? If there is one then please share. Thanks. – Rakesh Sep 01 '12 at 05:18
  • Well, Caliper looks like it could be useful. ;) (Thanks for bringing that to my attention, by the way.) – Alan Moore Sep 01 '12 at 07:38
  • @AlanMoore I have tested both the methods after incorporating your suggested modification. And still regex takes longer than user defined method. I looked into the java source code and found that there are several methods from different classes(Pattern, Matcher, Node, etc.) involved in checking the regular expression. And in some places, synchronize blocks have been used. Whereas user defined method has fewer method calls. These factors clearly explain the reason behind longer time being taken to check the regular expression. Please share your thoughts, in case you have a different opinion. – Rakesh Sep 01 '12 at 16:54
  • @Rakesh: I disagree, method calls alone explain nothing, as they often have zero overhead. That said, a well-written handcrafted method should be faster as it can directly test things for which a regexp engine used indirection (a regexp engine is a sort of interpreter). *Check your method `stripString`, it looks very suspicious to me.* – maaartinus Sep 02 '12 at 15:43
  • @maaartinus, Thanks for you opinion. I have changed the stripString() method as suggested by AlanMoore. Regarding Method calls, I mean to say these methods involves Regex engine computation which eventually add on to your method execution. – Rakesh Sep 02 '12 at 17:00