Linux CAN bus transmission timeout

Question

Scenario

There is a Linux-powered device connected to a CAN bus. The device periodically transmits the CAN message. The nature of the data carried by this message is like measurement rather than command, i.e. only the most recent one is actually valid, and if some messages are lost that is not an issue as long as the latest one was received successfully.

Then the device in question is being disconnected from the CAN bus for some amount of time that is much longer than the interval between subsequent message transmissions. The device logic is still trying to transmit the messages, but since the bus is disconnected the CAN controller is unable to transmit any of them so the messages are being accumulated in the TX queue.

Some time later the CAN bus connection is restored, and all the accumulated messages are being kicked on the bus one by one.

Problem

When the CAN bus connection is restored, undefined amount of outdated messages will be transmitted from the TX queue.
While the CAN bus connection is still not available but TX queue is already full, transmission of some most recent messages (i.e. the only valid messages) will be discarded.
Once the CAN bus connection is restored, there would be short term traffic burst while the TX queue is being flushed. This can alter the Time Triggered Bus Scheduling if one is used (it is in my case).

Question

My application uses SocketCAN driver, so basically the question should be applied to SocketCAN, but other options are considered too if there are any.

I see two possible solutions: define a message transmission timeout (if a message was not transmitted during some predefined amount if time, it will be discarded automatically), or abort transmission of outdated messages manually (though I doubt it is possible at all with socket API).

Since the first option seems to be most real to me, the question is:

How does one define TX timeout for CAN interface under Linux?
Are there other options exist to solve the problems described above, aside from TX timeouts?

I'd ask this question on the [linux-can mailing list](http://vger.kernel.org/vger-lists.html#linux-can). — yegorich, Oct 28 '13 at 13:32

score 0 · Answer 1 · answered Oct 19 '22 at 05:10

My solution for this problem was shutting down and bringing the device up again:

void
    clear_device_queue
        (void)
{
    if (!queue_cleared)
    {
        const char
            *dev = getenv("MOTOR_CAN_DEVICE");
        char
            cmd[1024];

        sprintf(cmd, "sudo ip link set down %s", dev);
        system(cmd);

        usleep(500000);

        sprintf(cmd, "sudo ip link set up %s", dev);
        system(cmd);

        queue_cleared = true;
    }
}

score -2 · Answer 2 · answered Apr 17 '20 at 15:09

I don't know the internals of SocketCAN, but I think the larger part of the problem should be solved on a more general, logical level.

Before, there is one aspect to clarify: The question includes tag safety-critical...

If the CAN communication is not relevant to implement a safety function, you can pick any solution you find useful. There may be parts of the second alternative which are useful for you in this case too, but those are not mandatorx.
If the communication is, however used in a safety-relevant context, there must be a concept that takes into account the requirements imposed by IEC 61508 (safety of programmable electronic systems in general) and IEC 61784-x/62280 (safe communcation protocols). Those standards usually lead to some protocol measures that come in handy with any embedded communication, but especially for the present problem:
- Add a sequence counter to the protocol frames. The receiver shall monitor that it the counter values it sees don't make larger "jumps" than allowed (e.g., if you allow to miss 2 frames along the way, max. counter increment may be +3. CAN bus may redouble a frame, so a counter increment of +0 must be tolerated, too.
- The receiver must monitor that every received frame is followed by another within a timeout period. If your CAN connection is lost and recovered in the meantime, it depends if the interruption was longer or within the timeout. Additionally, the receiver may monitor that a frame doesn't follow the preceding one too early, but if the frames include the right data, this usually isn't necessary.
- [...] The nature of the data carried by this message is like measurement rather than command, i.e. only the most recent one is actually valid, and if some messages are lost that is not an issue as long as the latest one was received successfully.
  
  Through CAN, you shall never communicate "commands" in the meaning that every one of them can trigger a change, like "toggle output state" or "increment set value by one unit" because you never know whether the frame reduplication hits you or not.
  
  Besides, you shall never communicate "anything safety-relevant" through a single frame because any frame may be lost or broken by an error. Instead, "commands" shall be transferred (like measurements) as a stream of periodical frames with measurement or set value updates.

Now, in order to get the required availability out of the protocol design, the TX queue shouldn't be long. If you actually feel as you need that queue, it could be that the bus is overloaded, compared to the timing requirements it faces. From my point of view, the TX "queue" shouldn't be longer than one or two frames. Then, the problem of recovering the CAN connection is nearly fixed...

Linux CAN bus transmission timeout

2 Answers2