0

i come for GPU world. When i submit to the GPU an image 1024x1024 pixels to be processed, i know that there are not 1048576 threads running in parallel on the GPU. If the wavesize of the GPU is 64, then 64 threads truly run in parallel for real. And then many of these wavesizes run in parallel too. I would say that a GPU can truly run in parallel the total of its stream processors of threads at the same time. This ranges from few hundreds to 10K. For an image of 1024x1024, a GPU with 10K threads must run 100 parallel workload chunks of 10K threads each in serial. (Simplifying it for the sake of the example. Actually GPUs perform very complicated management of workload. But i simplify much to can give an example of what i mean.)

How many truly parallel threads i can run on a FPGA. I mean for real. Lets take for a moment the SIMDs from AVX technology on Intel. What is the largest SIMD register i can program inside a FPGA?

I leave the supporting/additional ticks of the clock of the device outside this question. Let say i prepare 1mln threads to run in parallel. But i am not running them yet in parallel, i am just preparing the data. For the example, i would even stale the execution flow of some threads, until they are not all ready. All of the 1mln of threads. Then i need to perform an AND operation. And i want as many AND operations as possible to happen in parallel during one single tick of the clock. At some point in the program i want one tick to run as many AND ops in parallel as possible. Can i run 1mln ANDs in parallel on a FPGA? For the sake of the example, lets assume the FPGA is large enough.

Pipi
  • 11
  • Welcome to Stack Overflow. Please read through the [Help Center > Asking](https://stackoverflow.com/help/on-topic) section, and ensure your question follows the guidelines, since that will give you most success getting answers in this forum. – Morten Zilmer Aug 26 '21 at 13:06
  • I’m voting to close this question because the question is about a general technology, and not programming specific. – Morten Zilmer Aug 26 '21 at 13:09
  • You can get insight into FPGA technology by looking at some of the many FPGA tutorials on the net. For a short answer, then FPGAs are made of only Flip-Flops and gates with programmable connectivity and operation, so you can program the FPGA to be just a parallel as you want. So the question is like asking "How parallel are transistors" ? – Morten Zilmer Aug 26 '21 at 13:16
  • @MortenZilmer thank you for your last answer! I have not time to study every single technology on the internet. I needed to know only this to decide if it is worth further reading. For my algorithm is highly parallel. In my own opinion, i consider it appropriate to ask if i can program self-modifying code in ASM Programming Language before i start reading the thousands of pages long manuals from Intel. In case that, being able to write self-modifying code is my main concern. In my own opinion, what i did was appropriate. – Pipi Aug 26 '21 at 15:40
  • @MortenZilmer , and I am very sorry, if i accidentally broke some rule from the Law of Jante. – Pipi Aug 26 '21 at 15:42
  • There is nothing wrong in asking, but if you want an useful answer, you need to ask in the right forum, and StackOverflow is just not the right forum for this kind of question. Top of my mind, I would suggest Reddit or Quora instead. – Morten Zilmer Aug 26 '21 at 19:26
  • FPGA are very different from GPUs. AFAIK, FPGA are clearly much more low-level as you need to design the circuit to compute your computation as opposed to GPUs designed to perform numerical IEEE flotting-point computations. There are tools automatically generate the low-level representation for FPGA but there is still a need to optimize/update/configure the FPGA with the target circuit (it can be done automatically but this is slow). So FPGA are a bit like designing your own processor fitting your specific needs while GPUs are specialized computing units that can execute a user code directly. – Jérôme Richard Aug 27 '21 at 08:26
  • You'll get a 100% load on a GPU only in a rare ideal case. The moment you have any kind of a memory access, it's not that parallel any more. With an FPGA you can get much higher levels of parallelism for tricky cases, since you can design your own memory architecture. If all you want is some kind of an SPMD load, levels of parallelism are limited by FPGA resources (DSP slices, LUTs, etc.) and by the memory access pattern and potential for parallelising it. The former can be solved by picking up a larger FPGA, the latter is a fundamental limitation. – SK-logic Sep 06 '21 at 08:51

1 Answers1

0

FPGA's are effectively a collection of gates, flops, memory and interfaces that I will refer to as resources collectively. Depending on the vendor, there are various architectures and devices sizes that provide different quantities of those resources that change the price point. We are talking about devices that could be as small as hundred gates or flops and no memory to devices with hundreds of thousands of flops and Megabits of memory.

Resource scale will determine your ultimate parallelization scale. You need to determine how many resources a single processing instance will require in terms of memory, flops, gates, etc.. The single instance resource count can be used as a denominator in a ratio with the overall resource counts of a particular device. In practice, FPGA's become harder to synthesize the fuller they get so that will be an adjustment to consider. So far the calculation looks like this:

                            (total resources - reserved resources)
total_parallel_instances =  ______________________________________
                                single_instance_resource

There are other limits you will observe relative to parallelization as well that will affect the ultimate answer about what level of parallelization can be achieved realistically.

You ask about number of AND's that can be run. If an FPGA has 100 logic blocks, and each block has 2 configurable gates that can be implemented as AND's then you can have 200 AND gates. If that isn't enough for a particular application, a larger device can be selected with more logic blocks. The real limit factor here, is how much can you spend on the device and how much board realestate can you afford. If cost and space are not a factor, you can have millions of AND's if need be.

Approaching your question about as an FPGA architect, I would point out that the FPGA device has to interface to your overall system. If this is part of an x86 system, the most traditional mechanism would be a PCIe interface. Your design could be different, but this is an important limiting factor. Your data rate in and out is limited by your interface so depending on your design, your parallelism doesn't need to be greater than what the interface can support. Additionally, you do have to factor in how long your processing takes.

Lets say a single instance of a processing mechanism take 100 clocks of processing time to obtain a result. And the time it takes to move the data in to the processor is 50 clocks. As a result, 2 bundles of data can be transferred in the time it takes to process one bundle. Discarding any other inefficiencies, only 2 processing instances are required to keep up with the interface. Adding a third processing mechanism in this situation would result in having an idled processing instance hanging around 33% of the time.

Parallel mechanisms that work with packetized data can/may use DMA based technology to move information into and out of an FPGA design. There is overhead associated with that DMA engine and software. That overhead will effect the interfaces data rate too relative to data size and will be observed each time a DMA transfer starts. Bundling up a bunch of data that can feed several instances of parallel FPGA processing instances in a single transfer, will result in only one overhead hit, thus raising the efficiency.

That all said, you will find that if the data you are moving is very small and the over head is high enough and processing time small enough, you will see that doing the job in software is actually faster. But if the processing time is long relative to the transfer overhead, that's when the parallelism scale will have more benefit.

Rich Maes
  • 1,204
  • 1
  • 12
  • 29