2

I've written some Linux device drivers but I am still at the level of newbie hack. I can get them working but that's all I can claim. So far, I've been able to work them into a model of write data using write() and read data using read(). I occasionally use ioctl for more fine-tuned control.

Now I want to build a coprocessing block in FPGA logic and write a device driver for the ARM processor in that same FPGA to offload work from the ARM to the FPGA. I'm having a hard time working out how best to design this interface.

If access to the coprocessor was exclusive, data could be written to the driver, the processing would happen in the FPGA fabric, and the data would be retrieved with a call to read. However, exclusive access to the coprocessing hardware would be a waste. Ideally any user space process could use the hardware if it's available. I believe it would work if policy required user space processes to open the device, write data, read results then close the file. It seems like the overhead of opening and closing the file each time the coprocessor needs to be accessed offsets the benefit of offloading the work in the first place.

I understand that there is a world of issues to be dealt with inside the device driver code to safely handle multiple access to the hardware. But just from a high level, I would love to see a concept that would make this interface work and adhere to good practices for Linux device drivers.

Temporarily sweeping aside all complications the ideal seems like a system where any process can open the device and have an access point where data is written to the device, perhaps in a blocking call, and data is read after the coprocessor does it's magic. The driver would handle the hardware accesses and the calling processes can keep the device file open for as long as it's needed. Absolutely any insights or guidance would be greatly appreciated!

This is all extra information in case anyone cares or it's somehow useful or interesting:

This particular FPGA is a Zynq device from Xilinx. It has a dual-core Cortex ARM A9 on the same silicon as the FPGA fabric (which is based on their Kintex family). The system is running Arch Linux for ARM and has done so quite beautifully for a year now. I use the generic name "coprocessor hardware" because the idea is that this chunk of hardware will gain capability over time while the user-space interface to it's device driver remains fairly constant. You will be able to, for example, write 1024 samples and have this block perform a low-pass filtering operation, an FFT, etc and get the results faster than the processor could have done so on it's own.

Thank you! This is my first question here so I apologize for breaches of protocol and inherent ignorance.

--Tim

  • "It seems like the overhead of opening and closing the file each time the coprocessor needs to be accessed offsets the benefit of offloading the work in the first place". That statement is almost certain to be incorrect unless your coprocessor is doing a noop (unlikely, right)? Open and close are hardly intensive operations. So why complicate things? Open, read, write, close. Sounds good to me. Unless you can think of specific reasons why that is problematic (I couldn't really get any from your post). – kaylum Mar 13 '15 at 21:13
  • Really? That would be great. I had tested this with another driver and had convinced myself that the file open / close was a significant overhead. I may very well have fooled myself. I'll take a closer and more rigorous look. Thanks for challenging my assumption! – user2142412 Mar 16 '15 at 01:39
  • Depends what you mean by "significant overhead". The open syscall numbers from lmbench (micro benchmark suite) obviously varies depending on the system configuration. But it's in the order of 6-30 microseconds. So it would be interesting to know how you came to the initial conclusion regarding open. – kaylum Mar 16 '15 at 02:58
  • I think the answer is that I came by the initial conclusion with a combination of sloppy testing and a bias towards the result I thought I would get. I should have known better! Thanks again. – user2142412 Mar 16 '15 at 12:34

1 Answers1

2

My team has been working on this sort of thing for a couple of years. To enable the lowest latency between the CPU and the programmable logic, we memory map the hardware into the application process so that it can directly communicate with the hardware. This eliminates the OS overhead after the initial connection.

Still, we find that CPU -> accelerator and back is at least 1 microsecond. This leads us to offload bigger chunks of work or use this path to configure data acquisition operations that write results directly to the system DRAM.

Depending on the mix of work, there are a variety of ways you can arrange for the accelerator to be shared.

  1. You can have a mutex protecting the hardware, so that each process using it has exclusive access.

  2. You can have a daemon with exclusive access, and have it multiplex requests and demultiplex responses.

  3. Your accelerator can provide multiple independent ports that can be used simultaneously by different processes. You need a way to assign the ports to processes and to reclaim them afterward.

If your accelerator has request and response queues, they can be accessed either by programmed IO (memory map hardware registers) or by shared memory queues in system DRAM (and DMA to/from the programmable logic).

See our FPGA 2015 paper for some more discussion of these approaches: http://www.connectal.org/connectal-fpga2015.pdf

Jamey Hicks
  • 2,340
  • 1
  • 14
  • 20