1

This is the python version of the same C++ question.

Given a number, num, what is the fastest way to strip off the trailing zeros from its binary representation?

For example, let num = 232. We have bin(num) equal to 0b11101000 and we would like to strip the trailing zeros, which would produce 0b11101. This can be done via string manipulation, but it'd probably be faster via bit manipulation. So far, I have thought of something using num & -num

Assuming num != 0, num & -num produces the binary 0b1<trailing zeros>. For example,

num   0b11101000
-num  0b00011000
&         0b1000

If we have a dict having powers of two as keys and the powers as values, we could use that to know by how much to right bit shift num in order to strip just the trailing zeros:

#        0b1     0b10     0b100     0b1000
POW2s = {  1: 0,    2: 1,     4: 2,      8: 3, ... }

def stripTrailingZeros(num):
  pow2 = num & -num
  pow_ = POW2s[pow2]  # equivalent to math.log2(pow2), but hopefully faster
  return num >> pow_

The use of dictionary POW2s trades space for speed - the alternative is to use math.log2(pow2).


Is there a faster way?


Perhaps another useful tidbit is num ^ (num - 1) which produces 0b1!<trailing zeros> where !<trailing zeros> means take the trailing zeros and flip them into ones. For example,

num    0b11101000
num-1  0b11100111
^          0b1111

Yet another alternative is to use a while loop

def stripTrailingZeros_iterative(num):
  while num & 0b1 == 0:  # equivalent to `num % 2 == 0`
    num >>= 1
  return num

Ultimately, I need to execute this function on a big list of numbers. Once I do that, I want the maximum. So if I have [64, 38, 22, 20] to begin with, I would have [1, 19, 11, 5] after performing the stripping. Then I would want the maximum of that, which is 19.

Trenton McKinney
  • 56,955
  • 33
  • 144
  • 158
joseville
  • 685
  • 3
  • 16
  • 1
    You don't really need to divide by 2 until you have an odd result, to find if a number is odd. If it has a 1 as the least significant bit, it's odd. So why not simply compute the maximum of all numbers whose least significant bit is odd, since that's your end goal, instead of all the manipulation first? – Grismar Mar 22 '22 at 01:46
  • @Grismar Thanks! I wasn't clear, sorry. I want to find the maximum of the numbers after I have performed the stripping to them. So if I have `[64, 38, 22, 20]` to begin with, I would have `[1, 19, 11, 5]` after performing the stripping. Then I would want the maximum of that, which is `19`. – joseville Mar 22 '22 at 01:54

4 Answers4

4

There's really no answer to questions like this in the absence of specifying the expected distribution of inputs. For example, if all inputs are in range(256), you can't beat a single indexed lookup into a precomputed list of the 256 possible cases.

If inputs can be two bytes, but you don't want to burn the space for 2**16 precomputed results, it's hard to beat (assuming that_table[i] gives the count of trailing zeroes in i):

low = i & 0xff
result = that_table[low] if low else 8 + that_table[i >> 8]

And so on.

You do not want to rely on log2(). The accuracy of that is entirely up to the C library on the platform CPython is compiled for.

What I actually use, in a context where ints can be up to hundreds of millions of bits:

    assert d

    if d & 1 == 0:
        ntz = (d & -d).bit_length() - 1
        d >>= ntz

A while loop would be a disaster in this context, taking time quadratic in the number of bits shifted off. Even one needless shift in that context would be a significant expense, which is why the code above first checks to see that at least one bit needs to shifted off. But if ints "are much smaller", that check would probably cost more than it saves. "No answer in the absence of specifying the expected distribution of inputs."

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • Thank you! The numbers are in the range `1` to `10^9`, so up to 30-bits long, but let's just say they're up to 32-bits for simplicity. – joseville Mar 22 '22 at 03:29
  • 1
    Nope, that can make a big difference - really! On a 64-bit box, 30 bits fit in a single CPython internal "digit", but 32 bits do not - those require 2-"digit" internal bigints. Entirely different algorithms come into play. For 30 bits, @Nick's answer using int division looks promising, but much less so if multi-"digit" int division needs to be used. If you need to do serious work, install `gmpy2` and use its integer type's `.bit_scan1()` to search directly for the least-significant 1 bit (and then shift). – Tim Peters Mar 22 '22 at 03:38
  • thanks once again! multi-"digit" int division would only be used for the so called 2+ "digit" internal bigints, right? – joseville Mar 22 '22 at 03:41
  • 1
    The bit length of the numerator doesn't matter. If the denominator fits in a single internal digit (at most 30 bits on a 64-bit box; at most 15 on a 32-bit box), then an especially simple division algorithm can be used. Else a fully general all-possible-cases algorithm is used. – Tim Peters Mar 22 '22 at 03:45
  • 1
    So, really, in practice, I expect it's most likely that the denominators in @Nick's code will usually fit in a single Python bigint "digit" after all. – Tim Peters Mar 22 '22 at 03:48
3

On my computer, a simple integer divide is fastest:

import timeit
timeit.timeit(setup='num=232', stmt='num // (num & -num)')
0.1088077999993402
timeit.timeit(setup='d = { 1: 0, 2 : 1, 4: 2, 8 : 3, 16 : 4, 32 : 5 }; num=232', stmt='num >> d[num & -num]')
0.13014470000052825
timeit.timeit(setup='import math; num=232', stmt='num >> int(math.log2(num & -num))')
0.2980690999993385
Nick
  • 138,499
  • 22
  • 57
  • 95
2

You say you "Ultimately, [..] execute this function on a big list of numbers to get odd numbers and find the maximum of said odd numbers."

So why not simply:

from random import randint


numbers = [randint(0, 10000) for _ in range(5000)]


odd_numbers = [n for n in numbers if n & 1]
max_odd = max(odd_numbers)
print(max_odd)

To do what you say you want to do ultimately, there seems to be little point in performing the "shift right until the result is odd" operation? Unless you want the maximum of the result of that operation performed on all elements, which is not what you stated?

I agree with @TimPeters answer, but if you put Python through its paces and actually generate some data sets and try the various solutions proposed, they maintain their spread for any number of integer size when using Python ints, so your best option is integer division for numbers up to 32-bits, after that see the chart below:

from pandas import DataFrame
from timeit import timeit
import math
from random import randint


def reduce0(ns):
    return [n // (n & -n)
            for n in ns]


def reduce1(ns, d):
    return [n >> d[n & -n]
            for n in ns]


def reduce2(ns):
    return [n >> int(math.log2(n & -n))
            for n in ns]


def reduce3(ns, t):
    return [n >> t.index(n & -n)
            for n in ns]


def reduce4(ns):
    return [n if n & 1 else n >> ((n & -n).bit_length() - 1)
            for n in ns]


def single5(n):
    while (n & 0xffffffff) == 0:
        n >>= 32
    if (n & 0xffff) == 0:
        n >>= 16
    if (n & 0xff) == 0:
        n >>= 8
    if (n & 0xf) == 0:
        n >>= 4
    if (n & 0x3) == 0:
        n >>= 2
    if (n & 0x1) == 0:
        n >>= 1
    return n


def reduce5(ns):
    return [single5(n)
            for n in ns]


numbers = [randint(1, 2 ** 16 - 1) for _ in range(5000)]
d = {2 ** n: n for n in range(16)}
t = tuple(2 ** n for n in range(16))
assert(reduce0(numbers) == reduce1(numbers, d) == reduce2(numbers) == reduce3(numbers, t) == reduce4(numbers) == reduce5(numbers))

df = DataFrame([{}, {}, {}, {}, {}, {}])
for p in range(1, 16):
    p = 2 ** p
    numbers = [randint(1, 2 ** p - 1) for _ in range(4096)]

    d = {2**n: n for n in range(p)}
    t = tuple(2 ** n for n in range(p))

    df[p] = [
        timeit(lambda: reduce0(numbers), number=100),
        timeit(lambda: reduce1(numbers, d), number=100),
        timeit(lambda: reduce2(numbers), number=100),
        timeit(lambda: reduce3(numbers, t), number=100),
        timeit(lambda: reduce4(numbers), number=100),
        timeit(lambda: reduce5(numbers), number=100)
    ]
    print(f'Complete for {p} bit numbers.')


print(df)
df.to_csv('test_results.csv')

Result (when plotted in Excel): Local machine results (updated)

Note that the plot that was previously here was wrong! The code and data were not though. The code has been updated to include @MarkRansom's solution, since it turns out to be the optimal solution for very large numbers (over 4k-bit numbers).

Grismar
  • 27,561
  • 4
  • 31
  • 54
  • Thanks. I wasn't clear, sorry. *"Unless you want the maximum of the result of that operation performed on all elements,"* That is what I want. Edited the question to hopefully make it clear. – joseville Mar 22 '22 at 01:57
  • 1
    I think user TimPeters provided the correct answer, i.e. "it depends", but I think @Nick actually gave you the fastest method. – Grismar Mar 22 '22 at 04:20
  • Thanks so much for this profiling/analysis! `reduce1` and `reduce3` which use the dictionary and tuple list, respectively, would be even slower if the dict/tuple list creation time was talking into account. This, ofc, only reinforces your main point. – joseville Mar 22 '22 at 05:15
  • ``` for p in range(1, 16): p = 2 ** p numbers = [randint(1, 2 ** p - 1) for _ in range(4096)] ``` So in the last iteration of the `for` loop, we have `p = 15`, then `p = 2 ** 15 = 32768 `, then that means `randint(1, 2 ** p - 1) = randint(1, 2 ** (32768) - 1)` which means the numbers being generated are up to `32768`-bit long. Wow! So the x-axis of your Excel graph is max bit-length, not max value (as I had mistakenly thought at first glance). – joseville Mar 22 '22 at 05:20
  • thanks again. I took your profiling/timing code, added two more implementations of `reduce` (based on Mark Ransom's answer) and ran the tests on repl.it (https://replit.com/@joseville/Strip-Trailing-Zeros). I don't know if it's because I ran it in repl.it, but my results were different than yours. My results say that `reduce4` is the fastest, at least as bit-lengths gets larger than ~16: https://joseville.tumblr.com/post/679450667886297088/httpsstackoverflowcoma7156607613881506 – joseville Mar 22 '22 at 18:55
  • 2
    @joseville I did the same thing on my own PC with Python 3.8, and came up with results similar to yours. My own code was dismally slow on short bit lengths. But by the time you hit 4096 bits it was the fastest, and at 32678 bits it was almost twice as fast as `reduce4`! – Mark Ransom Mar 23 '22 at 00:46
  • The one from @MarkRansom relies on random/average numbers, though, so I wouldn't phrase it as *"the optimal solution for very large numbers"*. It's terrible for very large numbers with lots of trailing zeros. Relying on random/average numbers, a simple `while not n & 1: n >>= 1` seems about twice as fast for small numbers and only a bit slower for very large numbers. – Kelly Bundy May 29 '23 at 03:50
  • @KellyBundy how would you test something like this except by random numbers? OP didn't mention any restrictions or expected distributions of numbers, except for noting deep in a comment that they'll all be 30 bits or less. True random numbers are extremely unlikely to have large numbers of trailing zeros. – Mark Ransom May 31 '23 at 00:40
  • @MarkRansom *"True random numbers are extremely unlikely to have large numbers of trailing zeros"* - Yes, that's part of my point. How I'd test? I'd do *two* benchmarks. One with "average" cases (random numbers) and one with worst cases (depends on the solution and I didn't think them all through, but likely one 1-bit followed by only zeros, or first half random and second half zeros). – Kelly Bundy May 31 '23 at 03:06
1
while (num & 0xffffffff) == 0:
    num >>= 32
if (num & 0xffff) == 0:
    num >>= 16
if (num & 0xff) == 0:
    num >>= 8
if (num & 0xf) == 0:
    num >>= 4
if (num & 0x3) == 0:
    num >>= 2
if (num & 0x1) == 0:
    num >>= 1

The idea here is to perform as few shifts as possible. The initial while loop handles numbers that are over 32 bits long, which I consider unlikely but it has to be provided for completeness. After that each statement shifts half as many bits; if you can't shift by 16, then the most you could shift is 15 which is (8+4+2+1). All possible cases are covered by those 5 if statements.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 1
    Although the number of *different* shifts is minimised (as a shift of 5 would be a shift of 4 followed by a shift of 1), the total number of shifts would be the same or higher. The other solutions all propose a single shift only, for the price of computing by how much the shift needs to be, or avoiding the relatively cheap shift and relatively costly computation by using integer division. Your solution accepts a few shifts, it avoids the computation required for a single shift - however, it trades that for a number of bitwise 'and's, and if and while blocks - did you check if that's faster? – Grismar Mar 23 '22 at 02:41
  • 1
    I did, it is, for large numbers – Grismar Mar 23 '22 at 03:01
  • 1
    @Grismar thanks for the update. If we restrict ourselves to real world conditions which will almost always be 32 bits or less, it looks like `reduce0` is the winner. I could probably tweak my solution to be much faster with that limitation, but it hardly seems worth it. – Mark Ransom Mar 23 '22 at 03:20