3

I am writing a data acquisitioning program that needs to

  • wait for serial with select()

  • read serial data (RS232 at 115200 baud),

  • timestamp it (clock_gettime()),

  • read an ADC on SPI,

  • interpret it,

  • send new data over another tty device

  • loop and repeat

The ADC is irrelevant for now.

At the end of the loop I use select() again with a 0 timeout to poll and see whether data is available already, if it is it means I have overrun , I.e. I expect the loop to end before more data and for the select() at the start of the loop to block and get it as soon as it arrives.

The data should arrive every 5ms, my first select() timeout is calculated as (5.5ms - loop time) - which should be about 4ms.

I get no timeouts but many overruns.

Examining the timestamps reveals that select() blocks for longer than the timeout (but still returns>0). It looks like select() returns late after getting data before timeout.

This might happen 20 times in 1000 repeats. What might be the cause? How do I fix it?

EDIT: Here is cut down version of the code (I do much more error checking than this!)

#include <bcm2835.h> /* for bcm2835_init(), bcm2835_close() */

int main(int argc, char **argv){

    int err = 0;

    /* Set real time priority SCHED_FIFO */
    struct sched_param sp;
    sp.sched_priority = 30;
    if ( pthread_setschedparam(pthread_self(), SCHED_FIFO, &sp) ){
        perror("pthread_setschedparam():");
        err = 1;
    }

    /* 5ms between samples on /dev/ttyUSB0 */
    int interval = 5;

    /* Setup tty devices with termios, both totally uncooked, 8 bit, odd parity, 1 stop bit, 115200baud */
    int fd_wc=setup_serial("/dev/ttyAMA0");
    int fd_sc=setup_serial("/dev/ttyUSB0");

    /* Setup GPIO for SPI, SPI mode, clock is ~1MHz which equates to more than 50ksps */
    bcm2835_init();
    setup_mcp3201spi();

    int collecting = 1;

    struct timespec starttime;
    struct timespec time;
    struct timespec ftime;
    ftime.tv_nsec = 0;

    fd_set readfds;
    int countfd;
    struct timeval interval_timeout;
    struct timeval notime;

    uint16_t p1;
    float w1;

    uint8_t *datap = malloc(8);
    int data_size;
    char output[25];

    clock_gettime(CLOCK_MONOTONIC, &starttime);

    while ( !err && collecting ){   
        /* Set timeout to (5*1.2)ms - (looptime)ms, or 0 if looptime was longer than (5*1.2)ms */
        interval_timeout.tv_sec = 0;
        interval_timeout.tv_usec = interval * 1200 - ftime.tv_nsec / 1000;
        interval_timeout.tv_usec = (interval_timeout.tv_usec < 0)? 0 : interval_timeout.tv_usec;
        FD_ZERO(&readfds);
        FD_SET(fd_wc, &readfds);    
        FD_SET(0, &readfds); /* so that we can quit, code not included */   
        if ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, &interval_timeout))<0 ){
            perror("select()");
            err = 1;
        } else if (countfd == 0){
            printf("Timeout on select()\n");
            fflush(stdout);
            err = 1;
        } else if (FD_ISSET(fd_wc, &readfds)){
            /* timestamp for when data is just available */
            clock_gettime(CLOCK_MONOTONIC, &time)
            if (starttime.tv_nsec > time.tv_nsec){
                time.tv_nsec = 1000000000 + time.tv_nsec - starttime.tv_nsec;
                time.tv_sec = time.tv_sec - starttime.tv_sec - 1;
            } else {
                time.tv_nsec = time.tv_nsec - starttime.tv_nsec;
                time.tv_sec = time.tv_sec - starttime.tv_sec;
            }

            /* get ADC value, which is sampled fast so corresponds to timestamp */
            p1 = getADCvalue();

            /* receive_frame, receiving is slower so do it after getting ADC value. It is timestamped anyway */
            /* This function consists of a loop that gets data from serial 1 byte at a time until a 'frame' is collected. */
            /* it uses select() with a very short timeout (enough for 1 byte at baudrate) just to check comms are still going */
            /* It never times out and behaves well */
            /* The interval_timeout is passed because it is used as a timeout for responding an ACK to the device */
            /* That select also never times out */
            ireceive_frame(&datap, fd_wc, &data_size, interval_timeout.tv_sec, interval_timeout.tv_usec);

            /* do stuff with it */
            /* This takes most of the time in the loop, about 1.3ms at 115200 baud */
            snprintf(output, 24, "%d.%04d,%d,%.2f\n", time.tv_sec, time.tv_nsec/100000, pressure, w1);
            write(fd_sc, output, strnlen(output, 23)); 

            /* Check how long the loop took (minus the polling select() that follows */ 
            clock_gettime(CLOCK_MONOTONIC, &ftime);
            if ((time.tv_nsec+starttime.tv_nsec) > ftime.tv_nsec){
                ftime.tv_nsec = 1000000000 + ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
                ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec - 1;
            } else {
                ftime.tv_nsec = ftime.tv_nsec - time.tv_nsec - starttime.tv_nsec;
                ftime.tv_sec = ftime.tv_sec - time.tv_sec - starttime.tv_sec; 
            }

            /* Poll with 0 timeout to check that data hasn't arrived before we're ready yet */
            FD_ZERO(&readfds);
            FD_SET(fd_wc, &readfds);
            notime.tv_sec = 0;  
            notime.tv_usec = 0; 
            if ( !err && ( (countfd=select(fd_wc+1, &readfds, NULL, NULL, &notime)) < 0 )){
                perror("select()");
                err = 1;
            } else if (countfd > 0){
                printf("OVERRUN!\n");
                snprintf(output, 25, ",,,%d.%04d\n\n", ftime.tv_sec, ftime.tv_nsec/100000);
                write(fd_sc, output, strnlen(output, 24)); 
            }

        }

    }


    return 0;

}

The timestamps I see on the serial stream that I output is fairly regular (a deviation is caught up by the next loop usually). A snippet of output:

6.1810,0,225.25
6.1867,0,225.25
6.1922,0,225.25
6,2063,0,225.25
,,,0.0010

Here, up to 6.1922s everything is OK. The next sample is 6.2063 - 14.1ms after the last, but it didn't time out nor did the previous loop from 6.1922-6.2063 catch the overrun with the polling select(). My conclusion is that the last loop was withing the sampling time and select took -10ms too long return without timing out.

The ,,,0.0010 indicates the loop time (ftime) of the loop after - I should really be checking what the loop time was when it went wrong. I'll try that tomorrow.

  • How are you setting your masks?, and are you passing a null pointer for time argument, or setting the values of the struct to zero. Each results in its own behavior. – ryyker Feb 20 '17 at 14:54
  • You may have to include a _[SSCCE](http://sscce.org/)_ to illustrate the problem further. There is evidently something other than the timeout issue causing your issue. This will also attract a broader audience. – ryyker Feb 20 '17 at 15:47
  • Added Raspberry-pi to tags. – ryyker Feb 20 '17 at 15:53

2 Answers2

4

The timeout passed to select is a rough lower bound — select is allowed to delay your process for slightly more than that. In particular, your process will be delayed if it is preempted by a different process (a context switch), or by interrupt handling in the kernel.

Here is what the Linux manual page has to say on the subject:

Note that the timeout interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.

And here's the POSIX standard:

Implementations may also place limitations on the granularity of timeout intervals. If the requested timeout interval requires a finer granularity than the implementation supports, the actual timeout interval shall be rounded up to the next supported value.

Avoiding that is difficult on a general purpose system. You will get reasonable results, especially on a multi-core system, by locking your process in memory (mlockall) and setting your process to a real-time priority (use sched_setscheduler with SCHED_FIFO, and remember to sleep often enough to give other processes a chance to run).

A much more difficult approach is to use a real-time microcontroller that is dedicated to running the real-time code. Some people claim to reliably sample at 20MHz on fairly cheap hardware using that technique.

jch
  • 5,382
  • 22
  • 41
  • I've read that, hence my comment asking about preemption on ryykers answer. Would select() not time out if this happens but only return late? And could there really be so much going on that it overruns by 5ms? I've set SCHED_FIFO already with little difference. I'll try mlockall tomorrow (even though I'm using very little memory). I'm not sleeping at all, but I'd have thought that just spending the time in select() does the same. I'm only trying 250samples/sec here – Stefan Hartman Feb 20 '17 at 19:37
  • 1
    Yes, I woudn't be surprised by a badly-written driver introducing latencies on the order of 10ms. It looks like the SPI driver on the RPi has some issues — see https://www.raspberrypi.org/forums/viewtopic.php?t=19489 – jch Feb 20 '17 at 19:46
  • I'm using the bcm2835 library (http://www.airspayce.com/mikem/bcm2835/) that doesn't have the same problems (I believe). I will try without reading from the ADC tomorrow, though I've seen reports of much more that 200 samples/second. I'll edit my question with some code shortly. – Stefan Hartman Feb 20 '17 at 19:51
  • I have now tried adding all the POSIX real time extensions that I thought might help: generous amount of (absolute) clock_nanosleep(), mlockall, and setting SCHRD_FIFO with max priority. With further testing by measuring time everywhere I have concluded that the problem is actually not in select() but happens in different places, though strangely not during my receive select(). I am going to create a new question with new information and link that here and this there. Thanks for info. – Stefan Hartman Feb 25 '17 at 15:05
  • Makes sense — if there's an interrupt handler that holds the CPU for extended periods of time, then the delays can happen anywhere. – jch Feb 25 '17 at 15:43
3

If values for struct timeval are set to zero, then select will not block, but if timeout argument is a NULL pointer, it will...

If the timeout argument is not a NULL pointer, it points to an object of type struct timeval that specifies a maximum interval to wait for the selection to complete. If the timeout argument points to an object of type struct timeval whose members are 0, select() does not block. If the timeout argument is a NULL pointer, select() blocks until an event causes one of the masks to be returned with a valid (non-zero) value or until a signal occurs that needs to be delivered. If the time limit expires before any event occurs that would cause one of the masks to be set to a non-zero value, select() completes successfully and returns 0.

Read more here

EDIT to address comments, and add new information:

A couple of noteworthy points.

First - in the comments, there is a suggestion to add sleep() to your worker loop. This is a good suggestion. The reasons stated here, although dealing with thread entry points, still apply, since you are instantiating a continuous loop.

Second - Linux select() is a system call with an interesting implemantation history, and as such has a range of varying behaviours from implementation to implementation, some which may contribute to the unexpected behaviours you are seeing. I am not sure which of the major blood lines of Linux Arch Linux comes from, but the man7.org page for select() includes the following two segments, which per your descriptions appear to describe conditions that could possibly contribute to the delays you are experiencing.

Bad checksum:

Under Linux, select() may report a socket file descriptor as "ready
for reading", while nevertheless a subsequent read blocks. This could
for example happen when data has arrived but upon examination has wrong
checksum and is discarded.

Race condition: (introduces and discusses pselect())

...Suppose the signal handler sets a global flag and returns. Then a test
of this global flag followed by a call of select() could hang indefinitely
if the signal arrived just after the test but just before the call...

Given the description of your observations, and depending on how your version of Linux is implemented, either one of these implementation features may be a possible contributor.

ryyker
  • 22,849
  • 3
  • 43
  • 87
  • I think you mistyped, but with timeout set to 0 it won't block. That's what I expect and what I get on my second select() at the end of the loop. The problem is with the first select() with a non-zero timeout. It returns much later than timeout (sometimes 5ms) without timing out. I assume the data arrived before timeout but can't confirm. – Stefan Hartman Feb 20 '17 at 14:58
  • @StefanHartman - Yes, thank you. I just caught the mis-type. It believe it is correct now. Check that you are setting your masks correctly. I assume this is all happening in a single threaded app? – ryyker Feb 20 '17 at 15:04
  • Yes, single threaded. Masks and timeouts are set and reset correctly with each loop repeat. Is it possible that the thread is preempted sometime during select() that makes it return late without timing out? This is on a raspberry pi. – Stefan Hartman Feb 20 '17 at 15:14
  • I check the time immediately after select() so am pretty sure it is the culprit yet no timeouts – Stefan Hartman Feb 20 '17 at 15:17
  • @StefanHartman - I am not familiar with the OS on the Raspberry family, but the code behavior should be fairly deterministic for `select`. Post some example code in your question. – ryyker Feb 20 '17 at 15:52
  • I'm using Arch Linux. Code posted – Stefan Hartman Feb 21 '17 at 00:31
  • @StefanHartman - Thank you for posting code. Regarding your discussions and comments..., I am not sure what Arch Linux is most closely related to in the family of Linuxs', but I wonder if the occasional (_happen 20 times in 1000 repeats_) occurrences could be caused by either an occasional rejected check sum error, or perhaps race conditions related to system signal handling. I have provided small excerpts of discussion on both, and some links above. – ryyker Feb 21 '17 at 15:08