4

Python's len() and padding functions like string.ljust() are not tabstop-aware, i.e. they treat '\t' like any other single-width character, and don't round len() up to the nearest multiple of tabstop. Example:

len('Bear\tnecessities\t')

is 17 instead of 24 ( i.e. 4+(8-4)+11+(8-3) )

and say I also want a function pad_with_tabs(s) such that

pad_with_tabs('Bear', 15) = 'Bear\t\t'

Looking for simple implementations of these - compactness and readability first, efficiency second. This is a basic but irritating question. @gnibbler - can you show a purely Pythonic solution, even if it's say 20x less efficient?

Sure you could convert back and forth using str.expandtabs(TABWIDTH), but that's clunky. Importing math to get TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) also seems like massive overkill.

I couldn't manage anything more elegant than the following:

TABWIDTH = 8

def pad_with_tabs(s,maxlen):
  s_len = len(s)
  while s_len < maxlen:
    s += '\t'
    s_len += TABWIDTH - (s_len % TABWIDTH)
  return s

and since Python strings are immutable and unless we want to monkey-patch our function into string module to add it as a method, we must also assign to the result of the function:

s = pad_with_tabs(s, ...)

In particular I couldn't get clean approaches using list-comprehension or string.join(...):

''.join([s, '\t' * ntabs])

without special-casing the cases where len(s) is < an integer multiple of TABWIDTH), or len(s)>=maxlen already.

Can anyone show better len() and pad_with_tabs() functions?

smci
  • 32,567
  • 20
  • 113
  • 146
  • It's not clear what it is you want or how str.expandtabs doesn't fit the bill. Some example input and output would help clarify. –  Nov 17 '09 at 01:59
  • I said what I want in the first line: implementations of both len() and string.ljust() which are tabstop-aware. str.expandtabs() doesn't fit the bill because it blasts the tabs to spaces. We don't want that if we only want to measure len(). It seems wasteful to generate a throwaway copy by taking len(string.expandtabs(s)) – smci Nov 17 '09 at 02:18
  • 2
    "It seems wasteful to generate a throwaway copy by taking len(string.expandtabs(s))" Why? It seems simple and clean to me. Do you have any specific profile numbers that indicate that this is the bottleneck in your application? – S.Lott Nov 17 '09 at 02:44
  • If you *really* need the performance, just write the functions in C – John La Rooy Nov 17 '09 at 02:59
  • Folks - let's just neglect efficiency and go for compact readable Pythonic code? (even if it's say 20x less efficient) gnibbler has a good solution, any other entries? – smci Nov 17 '09 at 07:45
  • @smci: You provided compact readable code in your question. Then said you didn't like it. First: you already provided the most compact and readable code in your question. Second; you asked about performance. If you want to ask a different question, please ask a new question. – S.Lott Nov 17 '09 at 11:28

4 Answers4

8
TABWIDTH=8
def my_len(s):
    return len(s.expandtabs(TABWIDTH))

def pad_with_tabs(s,maxlen):
    return s+"\t"*((maxlen-len(s)-1)/TABWIDTH+1)

Why did I use expandtabs()?
Well it's fast

$ python -m timeit '"Bear\tnecessities\t".expandtabs()'
1000000 loops, best of 3: 0.602 usec per loop
$ python -m timeit 'for c in "Bear\tnecessities\t":pass'
100000 loops, best of 3: 2.32 usec per loop
$ python -m timeit '[c for c in "Bear\tnecessities\t"]'
100000 loops, best of 3: 4.17 usec per loop
$ python -m timeit 'map(None,"Bear\tnecessities\t")'
100000 loops, best of 3: 2.25 usec per loop

Anything that iterates over your string is going to be slower, because just the iteration is ~4 times slower than expandtabs even when you do nothing in the loop.

$ python -m timeit '"Bear\tnecessities\t".split("\t")'
1000000 loops, best of 3: 0.868 usec per loop

Even just splitting on tabs takes longer. You'd still need to iterate over the split and pad each item to the tabstop

John La Rooy
  • 295,403
  • 53
  • 369
  • 502
  • 2
    `pad_with_tabs` should probably call `my_len` instead of `len`, in case there are embedded tabs in the string to be tab-padded. – PaulMcG Nov 18 '09 at 07:41
1

I believe gnibbler's is the best for most prectical cases. But anyway, here is a naive (without accounting CR, LF etc) solution to compute the length of string without creating expanded copy:

def tab_aware_len(s, tabstop=8):
    pos = -1
    extra_length = 0
    while True:
        pos = s.find('\t', pos+1)
        if pos<0:
            return len(s) + extra_length
        extra_length += tabstop - (pos+extra_length) % tabstop - 1

Probably it could be useful for some huge strings or even memory mapped files. And here is padding function a bit optimized:

def pad_with_tabs(s, max_len, tabstop=8):
    length = tab_aware_len(s, tabstop)
    if length<max_len:
        s += '\t' * ((max_len-1)//tabstop + 1 - length//tabstop)
    return s
Denis Otkidach
  • 32,032
  • 8
  • 79
  • 100
0

TABWIDTH * int( math.ceil(len(s)*1.0/TABWIDTH) ) is indeed a massive over-kill; you can get the same result much more simply. For positive i and n, use:

def round_up_positive_int(i, n):
    return ((i + n - 1) // n) * n

This procedure works in just about any language I've ever used, after appropriate translation.

Then you can do next_pos = round_up_positive_int(len(s), TABWIDTH)

For a slight increase in the elegance of your code, instead of

while(s_len < maxlen):

use this:

while s_len < maxlen:
John Machin
  • 81,303
  • 11
  • 141
  • 189
0

Unfortunately I was unable to make use of accepted answer "as is" so here goes slightly modified version just in case someone would run into same problem and discovers this post via search:

from decimal import Decimal, ROUND_HALF_UP
TABWIDTH = 4

def pad_with_tabs(src, max_len):
    return src + "\t" * int(
        Decimal((max_len - len(src.expandtabs(TABWIDTH))) / TABWIDTH + 1).quantize(0, ROUND_HALF_UP))


def pad_fields(input):
    result = []
    longest = max(len(x) for x in input)
    for row in input:
        result.append(pad_with_tabs(row, longest))
    return result

Output list contains properly padded rows having tab count rounded so the resulting data will have same indentation level regardless of corner .5 cases when no tab gets added in the original answer.

im_infamous
  • 972
  • 1
  • 17
  • 29