1

I know there is something called interning in python, so basically

x, y = 1, 1
print(x is y) # True
x = 1234
y = 1234
print(x is y) # False

However when I wrap it into a script and run with python command it prints True twice. My guess is there are some optimizations under the hood but I cannot find any reference of them. Could someone explain what causes such behaviour and how to run that script without it?

I am on Ubuntu 20 and use CPython, version Python 3.9.9+ [GCC 9.3.0] on linux if that matters.

user3840170
  • 26,597
  • 4
  • 30
  • 62
kosciej16
  • 6,294
  • 1
  • 18
  • 29
  • A script is compiled in its entirety, and identical constants get noticed and merged together - this has nothing to do with interning. However, code at the interactive prompt gets compiled statement-by-statement, so this optimization cannot take place. – jasonharper Jan 24 '22 at 21:04

2 Answers2

4

First, and only important thing you have to know: you can't rely on "sameness" of Python literals, be them ints, strings, or whatever.

So, keep in mind this is absolutely irrelevant, but for the fact one always have to compare numbers, strings, and even "True" and "False" with ==, never with the is operator in any code intended to actually work in a consistent way.

That said, the reason the script will always print True in the case of a saved script, and will depend on version, runtime, lunar phase, CPU architecture in the interactive mode is simple:

with a script, the code is only executed after all of it has been compiled. While in interactive mode, each line of code is compiled and executed independently as you go.

So, when the compiler "sees" the same constant in the same block of code (the 1234 integer), it simply reuses the object it already created as a constant: it is a straightforward optimization.

While in the interactive mode, the literal will be "seen" only when compiling an entire new block of code, with a different internal state.

Regardless of the outputs and of this reasoning: this is not to be trusted. It is not part of the language specification. Compare numbers with == only - and forget there is a chance they might or not be the same object in memory. It is irrelevant either way.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
1

It’s called constant pooling, and it’s a pretty standard technique when implementing interpreters.

>>> def f():
...     x = 1234
...     y = 1234
...     return x is y
...
>>> f()
True
>>> import dis
>>> dis.dis(f)
  2           0 LOAD_CONST               1 (1234)
              2 STORE_FAST               0 (x)

  3           4 LOAD_CONST               1 (1234)
              6 STORE_FAST               1 (y)

  4           8 LOAD_FAST                0 (x)
             10 LOAD_FAST                1 (y)
             12 IS_OP                    0
             14 RETURN_VALUE

Each closed (self-contained) piece of bytecode carries a constant pool with it. When the compiler parses a suite as a single unit, literals found in the code at compile time are added into the pool; when the same value is encountered again, the constant pool slot is reused. When the function bytecode is later executed, the values are loaded from the pool onto the value stack, and then manipulated there. Here, both instances of the literal 1234 end up as reads from the same pool slot 1 (slot 0 is reserved for None). Because they read from the same slot, they end up reading the same object, which is of course, the same as itself.

Pooling can be applied not only to literals, but also to values obtained by constant folding:

>>> def g():
...     x = 4
...     y = 2 + 2
...     return x is y
...
>>> dis.dis(g)
  2           0 LOAD_CONST               1 (4)
              2 STORE_FAST               0 (x)

  3           4 LOAD_CONST               1 (4)
              6 STORE_FAST               1 (y)

  4           8 LOAD_FAST                0 (x)
             10 LOAD_FAST                1 (y)
             12 IS_OP                    0
             14 RETURN_VALUE

At the REPL prompt, every prompt triggers a separate compilation, which does not share a constant pool with any other; doing otherwise would arguably amount to having a memory leak. As such, number literals that are not otherwise interned end up referring to different objects when they are provided at different prompts.

>>> x = 1234
>>> y = 1234
>>> id(x)
140478281905648
>>> id(y)
140478281906160
>>> x is y
False

Constant pooling is pretty fundamental to the design of CPython and cannot be disabled as such. After all, the bytecode has no way to refer to a hardcoded value other than by referring to the constant pool. There is also no option that disables reusing constant pool slots for already-encountered values. But if you’re crazy enough…

def deadpool(func):
    import dis
    import opcode
    import functools

    new_cpool = [None]
    new_bcode = bytearray(func.__code__.co_code)
    _Func = type(lambda: 0)

    def pool(value):
        idx = len(new_cpool)
        new_cpool.append(value)
        return idx

    def clone(val):
        if isinstance(val, int):
            return int(str(val))
        return val

    op_EXTENDED_ARG = opcode.opmap['EXTENDED_ARG']
    op_LOAD_CONST = opcode.opmap['LOAD_CONST']

    insn_ext = None
    for insn in dis.get_instructions(func):
        if insn.opcode == op_LOAD_CONST:
            idx = pool(clone(func.__code__.co_consts[insn.arg]))
            assert idx < 256 or (idx < 65536 and had_ext)
            new_bcode[insn.offset + 1] = idx & 0xff
            if insn_ext:
                new_bcode[insn_ext.offset + 1] = idx >> 8
            insn_ext = None
        elif insn.opcode == op_EXTENDED_ARG:
            assert insn_ext is None
            insn_ext = insn
        else:
            insn_ext = None

    return functools.wraps(func)(_Func(
        func.__code__.replace(
            co_code=bytes(new_bcode),
            co_consts=tuple(new_cpool)
        ),
        func.__globals__,
        func.__name__,
        func.__defaults__,
        func.__closure__
    ))

def f():
    x = 1234
    y = 1234
    return x is y

@deadpool
def g():
    x = 1234
    y = 1234
    return x is y

print(f())   # True
print(g())   # False

…you can re-write the bytecode so that each constant load refers to a different slot, and then attempt to put a distinct, though otherwise indistinguishable object in each slot. (The above is just a proof-of-concept; there are some corner cases on which it fails, which would be much more laborious to cover fully.)

The above can be made to run in PyPy with only slight modifications. The results, however, will be different, because PyPy does not expose the identity of integer objects and always compares them by value, even when using the is operator. And after all, why should it not? As the other answer rightly points out, identity of primitives is an implementation detail with which you should not be concerned when writing ordinary code, and even most extraordinary code.

user3840170
  • 26,597
  • 4
  • 30
  • 62