1

I've created this script to compute the string similarity in python. Is there any way I can make it run any faster?

tries = input()
while tries > 0:
    mainstr = raw_input()
    tot = 0
    ml = len(mainstr)
    for i in xrange(ml):
        j = 0
        substr = mainstr[i:]
        ll = len(substr)
        for j in xrange(ll):
            if substr[j] != mainstr[j]:
                break
            j = j + 1
        tot = tot + j
    print tot
    tries = tries - 1

EDIT: After applying some optimization this is the code, but it's not enough!

tries = int(raw_input())
while tries > 0:
    mainstr = raw_input()
    tot = 0
    ml = len(mainstr)
    for i in xrange(ml):
        for j in xrange(ml-i):
            if mainstr[i+j] != mainstr[j]:
                break
            j += 1
        tot += j
    print tot
    tries = tries - 1

EDIT 2: The third version of the code. It's still no go!

def mf():
    tries = int(raw_input())
    for _ in xrange(tries):
        mainstr = raw_input()
        tot = 0
        ml = len(mainstr)
        for i in xrange(ml):
            for j in xrange(ml-i):
                if mainstr[i+j] != mainstr[j]:
                    break
                j += 1
            tot += j
        print tot
mf()
2hamed
  • 8,719
  • 13
  • 69
  • 112

4 Answers4

2

You can skip the memory allocation inside the loop. substr = mainstr[i:] allocates a new string unnecessarily. You only use it in substr[j] != mainstr[j], which is equivalent to mainstr[i + j] != mainstr[j], so you don't need to build substr.

Memory allocations are expensive, so you'll want to avoid them in tight loops.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Still exceeds time limit by `0.2s`. – 2hamed Jul 20 '12 at 10:55
  • 1
    @EdwinDrood: well, I can't open the link you posted because it won't accept my older Firefox. But generally, when computing string similarities, you'd use some kind of dynamic programming algorithm, e.g. the one listed on Wikipedia for [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance). – Fred Foo Jul 20 '12 at 11:20
2

You could improve it by a constant factor if you use i = mainstr.find(mainstr[0], i+1) instead of checking all i. Special case for i==0 also could help.

Put the code inside a function. It also might speed up things by a constant factor.

Use for ... else: j += 1 to avoid incrementing j at each step.

Try to find a better than O(n**2) algorithm that exploits the fact that you compare all suffixes of the string.

The most straight-forward C implementation is 100 times faster than CPython (Pypy is 10-30 times faster) and passes the challenge:

import os

def string_similarity(string, _cp=os.path.commonprefix):
    return sum(len(_cp([string, string[i:]])) for i in xrange(len(string)))

for _ in xrange(int(raw_input())):
    print string_similarity(raw_input())

The above optimizations give only several percents improvement and they are not enough to pass the challenge in CPython (Python time limit is only 8 time larger).

There is almost no difference (in CPython) between:

def string_similarity(string):
    len_string = len(string)
    total = len_string # similarity with itself
    for i in xrange(1, len_string):
        for n, c in enumerate(string[i:]):
            if c != string[n]:
                break
        else:
            n += 1

        total += n
    return total

And:

def string_similarity(string):
    len_string = len(string)
    total = len_string # similarity with itself
    i = 0
    while True:
        i = string.find(string[0], i+1)
        if i == -1:
            break
        n = 0
        for n in xrange(1, len_string-i):
            if string[i+n] != string[n]:
                break
        else:
            n += 1

        total += n
    return total
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • @larsmans: try something simple like: `i = 0 \n while i < 1000000: i += 1` at a module level and inside a function. See for yourself. – jfs Jul 20 '12 at 11:28
  • I immediately believed you, I was just wondering how that worked :) Is code at top-level compiled differently? – Fred Foo Jul 20 '12 at 11:33
  • 1
    @larsmans: I guess access to global names is slower than to local. – jfs Jul 20 '12 at 11:36
1

For such simple numeric scripts there are just two things you have to do:

  • Use PyPy (it does not have complex dependencies and will be massively faster)

  • Put most of the code in a function. That speeds up stuff for both CPython and PyPy quite drastically. Instead of:

    some_code

do:

def main():
    some_code

if __name__ == '__main__':
    main()

That's pretty much it.

Cheers, fijal

fijal
  • 3,190
  • 18
  • 21
  • Yeah, using PyPy dramatically decreased the time but as I said I'm trying to submit the code into the contest and there only CPython is used. Putting code inside a function did not help much. – 2hamed Jul 21 '12 at 18:20
  • Complain that they don't support PyPy, it sounds like a lousy thing to do :) – fijal Jul 21 '12 at 20:04
0

Here's mine. It passes the test case, but may not be the absolute fastest.

import sys

def simstring(string, other):
    val = 0
    for l, r in zip(string, other):
        if l != r:
            return val
        val += 1
    return val


dsize = sys.stdin.readline()

for i in range(int(dsize)):
    ss = 0
    string = sys.stdin.readline().strip()
    suffix = string
    while suffix:
        ss += simstring(string, suffix)
        suffix = suffix[1:]
    sys.stdout.write(str(ss)+"\n")
Keith
  • 42,110
  • 11
  • 57
  • 76
  • Yours seems to be slower than mine! I tested it with `10000` chars and yours took about 8s. Mine was about 4.8s. – 2hamed Jul 20 '12 at 16:24
  • 1
    Ah, well like I said it could be improved. But it's hard to compare times for different machines. Maybe I'll see how yours compares on my machine. – Keith Jul 20 '12 at 16:35
  • Well generally you are right but I tested both codes on my machine. – 2hamed Jul 20 '12 at 16:37