40

Given two string variables $string and $needle in perl, what's the most efficient way to check whether $string starts with $needle.

  • $string =~ /^\Q$needle\E/ is the closest match I could think of that does what is required but is the least efficient (by far) of the solutions I tried.
  • index($string, $needle) == 0 works and is relatively efficient for some values of $string and $needle but needlessly searches for the needle in other positions (if not found at the start).
  • substr($string, 0, length($needle)) eq $needle should be quite simple and efficient, but in most of my few tests is not more efficient than the previous one.

Is there a canonical way to do that in perl which I wouldn't be aware of or any way to optimise any of the above solutions?

(in my particular use case, $string and $needle are going to be different in each run, so precompiling a regexp is not an option).


Example of how to measure the performance of a given solution (here from a POSIX sh):

string='somewhat not so longish string' needle='somew'
time perl -e '
  ($n,$string,$needle) = @ARGV;
  for ($i=0;$i<$n;$i++) {

    index($string, $needle) == 0

  }' 10000000 "$string" "$needle"

With those values, index() performs better than substr()+eq with this system with perl 5.14.2, but with:

string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa"

That's reversed.

Stephane Chazelas
  • 5,859
  • 2
  • 34
  • 31

2 Answers2

35
rindex $string, $substring, 0

searches for $substring in $string at position <=0 which is only possible if $substring is a prefix of $string. Example:

> rindex "abc", "a", 0
0
> rindex "abc", "b", 0
-1
Gregory Kalabin
  • 1,760
  • 1
  • 19
  • 45
  • 2
    Perfect thanks. It's the feature I wasn't aware of and was looking for. I get similar timings for both test cases in the question, and it's faster than any of the other approaches on both. – Stephane Chazelas Apr 01 '19 at 12:30
27

How important is this, really? I did a number of benchmarks, and the index method averaged 0.68 microseconds per iteration; the regex method 1.14μs; the substr method 0.16μs. Even my worst-case scenarios (2250-char strings that were equal), index took 2.4μs, regex took 5.7μs, and substr took 0.5μs.

My advice is to write a library routine:

sub begins_with
{
    return substr($_[0], 0, length($_[1])) eq $_[1];
}

and focus your optimization efforts elsewhere.

UPDATE: Based on criticism of my "worst-case" scenario described above, I ran a new set of benchmarks with a 20,000-char randomly-generated string, comparing it against itself and against a string that differed only in the last byte.

For such long strings, the regex solution was by far the worst (a 20,000 character regex is hell): 105μs for the match success, 100μs for the match failure.

The index and substr solutions were still quite fast. index was 11.83μs / 11.86μs for success/failure, and substr was 4.09μs / 4.15μs. Moving the code to a separate function added about 0.222±0.05μs.

Benchmark code available at: http://codepaste.net/2k1y8e

I do not know the characteristics of @Stephane's data, but my advice stands.

Sue D. Nymme
  • 838
  • 7
  • 8
  • For the sake of argument, you may assume that that string matching is the critical point in my code and that all other possible optimisations have been done. In that regard, using a function will only decrease performance. But primarily, my hope when asking the question was that there was a better/canonical way to do that in perl. – Stephane Chazelas Jul 30 '15 at 14:51
  • 1
    For earlier `perl`s, you might want to [omit the return statement](http://www.effectiveperlprogramming.com/2014/06/perl-5-20-optimizes-return-at-the-end-of-a-subroutine/). – Sinan Ünür Jul 30 '15 at 15:27
  • @ikegami `substr` seems to be the fastest when the prefix does not match because it limits the extent of the search. – Sinan Ünür Jul 30 '15 at 15:28
  • 2
    NOT useless, @ikegami. Half of my benchmark cases were matches, half were match failures. – Sue D. Nymme Jul 30 '15 at 16:33
  • 2
    @SueD.Nymme: your posted answer is worded in a way that implies your worst-case test was only matching strings. Clearly the worst-case for `index` is an extremely long haystack that doesn't contain the needle anywhere, so it has to check all the way to the end. I'd agree with your conclusion, though: just use `substr`, since we've shown that it's not slower in common cases. It should have a *much* better worst-case, which is important for resisting DOS attacks (or accidental slowdowns). – Peter Cordes Jul 30 '15 at 16:52
  • 2
    Instead of simply dismissing my benchmark results, you could try to reproduce them. – Sue D. Nymme Jul 30 '15 at 19:37
  • @SueD.Nymme: you're only testing with `length($needle) == length($haystack)`, which is weird. You'd expect to see an even bigger difference between `substr` and `index` when `index` can spend time checking for matches at positions other than `0`. – Peter Cordes Jul 31 '15 at 01:52
  • 1
    @PeterCordes, and among the situations where the needle is not found in the string, there are those that are worse than others like the last example in the question where 111 byte-to-byte comparisons at least ((6+5+4+3+2+1)*5+6) are needed for a string of length 34 and needle of length 6. (it might even be the worst case scenario for string/needle this length, which would make for another interesting question here) – Stephane Chazelas Jul 31 '15 at 08:32
  • @StephaneChazelas: some smart regex engines will avoid pitfalls like that by checking if the end of the string matches, to avoid going far into a matching prefix over and over. `index` *could* do that, too. – Peter Cordes Jul 31 '15 at 08:46
  • @PeterCordes, I'm not sure I understand what you mean. In `string="aaaaabaaaaabaaaaabaaaaabaaaaabaaaaab" needle="aaaaaa"`, how would checking the trailing `aaaaab` against `aaaaaa` first reduce the number of comparisons needed? – Stephane Chazelas Jul 31 '15 at 09:02
  • @StephaneChazelas: oh you're right, in this case it won't help much if any, since there are more `a`s after every `b`. So it's only one mismatching character, rather than a matching prefix with mismatches later. – Peter Cordes Jul 31 '15 at 09:25
  • 2
    "How important is this, really?" Enough for the OP to ask the question and for you to write benchmarks. That opening question serves no purpose other than to ding the OP for asking, and is in conflict with the rest of the answer. The advice to write a library routine is independent of the question, and actually supports the question, since library routines should strive to be efficient. The most efficient implementation is `rindex($_[0], $_[1], 0) == 0`, and this unusual usage of `rindex` can be hidden away in the library routine along with a comment explaining it. – Jim Balter Aug 01 '19 at 19:59