6

If I have an input string and an array:

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

I am trying to find the longest common prefix between the consecutive elements of the array pos referencing the original s. I am trying to get the following output:

longest = [3,1]

The way I obtained this is by computing the longest common prefix of the following pairs:

  • s[15:] which is _be and s[2:] which is _be_or_not_to_be giving 3 ( _be )
  • s[2:] which is _be_or_not_to_be and s[8:] which is _not_to_be giving 1 ( _ )

However, if s is huge, I don't want to create multiple copies when I do something like s[x:]. After hours of searching, I found the function buffer that maintains only one copy of the input string but I wasn't sure what is the most efficient way to utilize it here in this context. Any suggestions on how to achieve this?

Legend
  • 113,822
  • 119
  • 272
  • 400

4 Answers4

2
>>> import os
>>> os.path.commonprefix([s[i:] for i in pos])
'_'

Let Python to manage memory for you. Don't optimize prematurely.

To get the exact output you could do (as @agf suggested):

print [len(commonprefix([buffer(s, i) for i in adj_indexes]))
       for adj_indexes in zip(pos, pos[1:])]
# -> [3, 1]
Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
2

Here is a method without buffer which doesn't copy, as it only looks at one character at a time:

from itertools import islice, izip

s = "to_be_or_not_to_be"
pos = [15, 2, 8]


length = len(s)    

for start1, start2 in izip(pos, islice(pos, 1, None)):
    pref = 0
    for pos1, pos2 in izip(xrange(start1, length), xrange(start2, length)):
        if s[pos1] == s[pos2]:
            pref += 1
        else:
            break
    print pref
# prints 3 1

I use islice, izip, and xrange in case you're talking about potentially very long strings.

I also couldn't resist this "One Liner" which doesn't even require any indexing:

[next((i for i, (a, b) in 
    enumerate(izip(islice(s, start1, None), islice(s, start2, None))) 
        if a != b), 
    length - max((start1, start2))) 
 for start1, start2 in izip(pos, islice(pos, 1, None))]

One final method, using os.path.commonprefix:

[len(commonprefix((buffer(s, n), buffer(s, m)))) for n, m in zip(pos, pos[1:])]
agf
  • 171,228
  • 44
  • 289
  • 238
  • +1 Thank you. Let me check the performance of this snippet and get back soon. It definitely look cool! :) – Legend Nov 10 '11 at 03:33
  • your `commonprefix()` solution is too complicated, see [my comment](http://stackoverflow.com/questions/8073808/longest-common-prefix-using-buffer/8073962#8073962) – jfs Nov 10 '11 at 03:53
  • @J.F.Sebastian I saw your comment; it's incorrect. His desired output is `[3, 1]`, not `_`. He wants _only the first two positions considered_, then _only the second two_, your version _considers all three at once_. – agf Nov 10 '11 at 03:55
1

I think your worrying about copies is unfounded. See below:

>>> s = "how long is a piece of string...?"
>>> t = s[12:]
>>> print t
a piece of string...?
>>> id(t[0])
23295440
>>> id(s[12])
23295440
>>> id(t[2:20]) == id(s[14:32])
True

Unless you're copying the slices and leaving references to the copies hanging around, I wouldn't think it could cause any problem.


edit: There are technical details with string interning and stuff that I'm not really clear on myself. But I'm sure that a string slice is not always a copy:

>>> x = 'google.com'
>>> y = x[:]
>>> x is y
True

I guess the answer I'm trying to give is to just let python manage its memory itself, to begin with, you can look at memory buffers and views later if needed. And if this is already a real problem occurring for you, update your question with details of what the actual problem is.

wim
  • 338,267
  • 99
  • 616
  • 750
  • Hmm... Sorry I am coming out ignorant. This post tells me a different story: http://stackoverflow.com/questions/3422685/what-is-python-buffer-type-for Please look at the comments section of the accepted answer. Am I missing something? – Legend Nov 10 '11 at 01:07
  • Guess I did not read your last line. +1 Thank you for the clarification. – Legend Nov 10 '11 at 01:26
  • @Legend Only short strings are interned though, so if your strings are truly long, slicing will in fact create copies. – agf Nov 10 '11 at 01:54
  • @agf: Of.. that's what I was looking for. Thanks! In that case, I am guessing that `buffer` does in fact reference to the original string without creating copies? I observed a massive speedup in the code I put in my solution after using a buffer. – Legend Nov 10 '11 at 02:05
  • @Legend Correct. I'm about to add another answer that won't do any copying without using `buffer`, but may very well be slower. – agf Nov 10 '11 at 02:10
  • 1
    Here's a relevant post from Guido on Python-Dev, and the rest of the thread is an interesting read. http://mail.python.org/pipermail/python-dev/2008-May/079700.html – wim Nov 10 '11 at 03:42
0

One way of doing using buffer this is give below. However, there could be much faster ways.

s = "to_be_or_not_to_be" 
pos = [15, 2, 8]

lcp = []
length = len(pos) - 1

for index in range(0, length):
    pre = buffer(s, pos[index])
    cur = buffer(s, pos[index+1], pos[index+1]+len(pre))

    count = 0

    shorter, longer = min(pre, cur), max(pre, cur)

    for i, c in enumerate(shorter):
        if c != longer[i]:
            break
        else:
            count += 1

    lcp.append(count)
    print 

print lcp
Legend
  • 113,822
  • 119
  • 272
  • 400
  • If you insist on using `buffer` you could do `os.path.commonprefix([buffer(s, i) for i in pos])` – jfs Nov 10 '11 at 03:08