Why does my long-running python script crash with "invalid pointer" after running for about 3 days?

Question

I wrote a python 3 script which tests an SPI link to an FPGA. It runs on an Raspberry Pi 3. The test works like this: after putting the FPGA in test mode (a push switch), send the first byte, which can be any value. Then further bytes are sent indefinitely. Each one increments by the first value sent, truncated to 8 bits. Thus, if the first value is 37, the FPGA expects the following sequence:

37, 74, 111, 148, 185, 222, 4, 41 ...

Some additional IO pins are used to signal between the devices - RUN (RPi output) starts the test (necessary because the FPGA times out in about 15ms if it expects a byte) and ERR (FPGA output) signals an error. Errors can thus be counted at both ends.

In addition, the RPi script writes a one line summary of bytes sent and number of erros every million bytes.

All of this works just fine. But after running for about 3 days, I get the following error on the RPi:

free(): invalid pointer: 0x00405340

I get this exact same error on two identical test setups, even the same memory address. The last report says "4294M bytes sent, 0 errors"

I seem to have proved the SPI link, but I am concerned that this long-running program crashes for no apparent reason.

Here is the important part of my test code:

def _report(self, msg):
        now = datetime.datetime.now()
        os.system("echo \"{} : {}\" > spitest_last.log".format(now, msg))

    def spi_test(self):
        global end_loop
        input("Put the FPGA board into SPI test mode (SW1) and press any key")
        self._set_run(True)
        self.END_LOOP = False
        print("SPI test is running, CTRL-C to end.")
        # first byte is sent without LOAD, this is the seed
        self._send_byte(self._val)
        self._next_val()
        end_loop = False
        err_flag = False
        err_cnt = 0
        byte_count = 1
        while not end_loop:
            mb = byte_count % 1000000 
            if mb == 0:
                msg = "{}M bytes sent, {} errors".format(int(byte_count/1000000), err_cnt)
                print("\r" + msg, end="")
                self._report(msg)
                err_flag = True
            else:
                err_flag = False
            #print("sending: {}".format(self._val))
            self._set_load(True)
            if self._errors and err_flag:
                self._send_byte(self._val + 1)
            else:
                self._send_byte(self._val)
            if self.is_error():
                err_cnt += 1
                msg = "{}M bytes sent, {} errors".format(int(byte_count/1000000), err_cnt)
                print("\r{}".format(msg), end="")
                self._report(msg)
            self._set_load(False)
            # increase the value by the seed and truncate to 8 bits
            self._next_val()
            byte_count += 1

        # test is done
        input("\nSPI test ended ({} bytes sent, {} errors). Press ENTER to end.".format(byte_count, err_cnt))
        self._set_run(False)

(Note for clarification : there is a command line option to artifically create an error every million bytes. Hence the " err_flag" variable.)

I've tried using python3 in console mode, and there seems to be no issue with the size of the byte_count variable (there shouldn't be, according to what I have read about python integer size limits).

Anyone have an idea as to what might cause this?

Did python give to an error callback stack? Such as what line it crashed on? You can put your code in a `try except` but that won't fix the issue it will just prevent it from breaking. Or is it an error on the pi only. Not from python? — Error - Syntactical Remorse, Apr 08 '19 at 14:49
no, that's pretty much the whole error output (except the name of the exe, which is python3). I could use a try/except, but as I have no part of the code is doing it, that doesn't really help. The only thing I can think of here is to try to reproduce on an RPi without the FPGA (which should be possible as long as the error pin stays in active) and then try removing stuff until it doesn't break. But at 3 days per try ... (OK that would get a bit faster without the hardware being involved, but still) ... — danmcb, Apr 08 '19 at 15:34
and of course I could try on an a standard linux distro on a laptop without the GPIO/SPI stuff. But surely there must be some ideas around as to what *could* do this before I start doing all that (not sure I really have time, TBH). — danmcb, Apr 08 '19 at 15:35
to me it looks like some kind of obscure issue with python itself or maybe garbage collection, as there doesn't seem to be anything in my code that grows over time (except the byte count). But I thought I'd get other opinions before making that kind of conjecture. — danmcb, Apr 08 '19 at 15:37
I agree it is weird. If it is a memory issue you could using a `string` instead of an `int` since the memory would be treated differently with a string. It may be a memory issue. Although python `int`s are unbounded you are giving it such a big value over time (since your loop has no delay) that it may be having an issue. It could also be that when python is trying to expand the `int` size your are writing to it which gives the memory error. Can you add a minor sleep inside of your `while` loop? — Error - Syntactical Remorse, Apr 08 '19 at 15:54
Also how big is the RAM on your raspberry pi? 4294M is a significant amount of RAM memory. I would convert `byte_count ` to something like a clock. When you get 1024 bytes convert it to 1Kb:0b, when you get 1024 Kb convert to 1Mb:0Kb:0b and so on. — Error - Syntactical Remorse, Apr 08 '19 at 16:04
4249M has nothing to do with memory size. It's just the number of times the loop has executed. It is necessary to count this to determine the error rate (if there were any errors). It's an RPi3 so it has 1GB RAM. — danmcb, Apr 09 '19 at 07:39
if it would be an issue with the size of byte_count, then I should have problems doing things like 4295000007 % 1000000 in a terminal. I don't. — danmcb, Apr 09 '19 at 07:40
my suspicion is that it is a memory leak in one of the libraries. I will see if I can reproduce on a standalone board, and then remove libs one by one until it goes. When I have time ... — danmcb, Apr 09 '19 at 07:42

Dmitri M. · Answer 1 · 2020-09-28T15:14:37.400

This issue is connected to spidev versions older than 3.5 only. The comments below were done under assumption that I was using the upgraded version of spidev.

#############################################################################

I can confirm this problem. It is persistent with both RPi3B and RPi4B. Using python 3.7.3 at both RPi3 and RPi4. The version of spidev which I tried were 3.3, 3.4 and the latest 3.5. I was able to reproduce this error several times by simply looping through this single line.

spidevice2.xfer2([0x00, 0x00, 0x00, 0x00])

It takes up to 11 hours depending on the RPi version. After 1073014000 calls (rounded to 1000), the script crashes because of "invalid pointer". The total amount of bytes sent is the same as in danmcb's case. It seems as if 2^32 bytes represent a limit.

I tried different approaches. For example, calling close() from time to time followed by open(). This did not help.

Then, I tried to create the spiDev object locally, so it would re-created for every batch of data.

def spiLoop():
    spidevice2 = spidev.SpiDev()
    spidevice2.open(0, 1)
    spidevice2.max_speed_hz = 15000000
    spidevice2.mode = 1 # Data is clocked in on falling edge
    
    for j in range(100000):
        spidevice2.xfer2([0x00, 0x00, 0x00, 0x00])
        
    spidevice2.close()

It still crashed at after approx. 2^30 calls of xfer2([0x00, 0x00, 0x00, 0x00]) which corresponds to approx. 2^32 bytes.

EDIT1

To speed up the process, I was sending in blocks of 4096 bytes using the code below. And I repeatedly created the SpiDev object locally. It took 2 hours to arrive at 2^32 bytes count.

def spiLoop():
    spidevice2 = spidev.SpiDev()
    spidevice2.open(0, 1)
    spidevice2.max_speed_hz = 25000000
    spidevice2.mode = 1 # Data is clocked in on falling edge
    
    to_send = [0x00] * 2**12 # 4096 bytes
    for j in range(100):
        spidevice2.xfer2(to_send)
        
    spidevice2.close()
    del spidevice2

def runSPI():
    for i in range(2**31 - 1):
        spiLoop()            
        print((2**12 * 100 * (i + 1)) / 2**20, 'Mbytes')

EDIT2

Reloading the spidev on the fly does not help either. I tried this code on both RPi3 and RPi4 with the same result:

import importlib
def spiLoop():
    importlib.reload(spidev)
    spidevice2 = spidev.SpiDev()
    spidevice2.open(0, 1)
    spidevice2.max_speed_hz = 25000000
    spidevice2.mode = 1 # Data is clocked in on falling edge
    
    to_send = [0x00] * 2**12 # 4096 bytes
    for j in range(100):
        spidevice2.xfer2(to_send)
        
    spidevice2.close()
    del spidevice2

def runSPI():
    for i in range(2**31 - 1):
        spiLoop()            
        print((2**12 * 100 * (i + 1)) / 2**20, 'Mbytes')

EDIT3

Executing the code snippet did not isolate the problem either. It crashed after the 4th chuck of 1Gbyte-data was sent.

program = '''
import spidev
spidevice = None

def configSPI():
    global spidevice
    
    # We only have SPI bus 0 available to us on the Pi
    bus = 0
    #Device is the chip select pin. Set to 0 or 1, depending on the connections
    device = 1

    spidevice = spidev.SpiDev()
    spidevice.open(bus, device)
    spidevice.max_speed_hz = 250000000
    
    spidevice.mode = 1 # Data is clocked in on falling edge

def spiLoop():
    to_send = [0xAA] * 2**12
    loops = 1024
    for j in range(loops):
        spidevice.xfer2(to_send)
    
    return len(to_send) * loops    

configSPI()
bytes_total = 0

while True:
    bytes_sent = spiLoop()
    bytes_total += bytes_sent            
    print(int(bytes_total / 2**20), "Mbytes", int(1000 * (bytes_total / 2**30)) / 10, "% finished")
    if bytes_total > 2**30:
        break

'''
for i in range(100):
    exec(program)
    print("program executed", i + 1, "times, bytes sent > ", (i + 1) * 2**30)

plugwash · Answer 2 · 2020-09-30T22:45:31.743

I belive the original asker's issue is a reference leak. Specifically py-spidev issue 91. Said reference leak has been fixed in the 3.5 release of spidev.

Python uses a shared pool of objects to represent small integer values*, rather than re-creating them each time. So when code leaks references to small numbers the result is not a memory leak but instead a reference count that keeps increasing. The python spidev library had an issue where it leaked references to small integers in this way.

On a 32-bit system** the eventual result is that the reference count overflows. Then something decrements the overflowed reference count and the reference counting system frees the object.

~~What I can't explain is the other answer that claims they can still reproduce the issue with 3.5. This issue was supposed to have been fixed in that version.~~

* Specifically numbers in the range -3 to 256 inclusive, so anything that can be represented in an unsigned byte plus a few negative values (presumably because they are commonly used as error returns) and 256 (presumably because it's often used as a multiplier).

** On a 64-bit system the reference count will not overflow within a human lifetime.

The original asker's issue could have been solved by using spidev 3.5. I **wrongly** claimed that I observed this with ver. 3.5. The upgrade was done for python2.7 while python3 still had ver. 3.3/3.4 installed. — Dmitri M., Sep 28 '20 at 15:11

Why does my long-running python script crash with "invalid pointer" after running for about 3 days?

2 Answers2

Linked