-1

I am trying to speed up the execution of my algorithm on FPGA. I'm trying to look for fixed math libraries with 32:32 (64) length in C code which would be easy translated to OpenCL. Is there anyone that knows a good library? I am trying to avoid using 128bit data types since they are floating point on OpenCL and I guess it won't speed up my algorithm if I have to use floating point again. Any suggestion is appreciated. If there is a guide to create a own library I'm ok with that as long as it explains it easy enough haha.

Thanks

  • I think you will find that a fixed-point library is slower than just using the floating point hardware on the device. These days, floating point is well optimized, and integer units are not as strong on GPUs because most graphics algorithms use floating point. – Dithermaster Sep 22 '17 at 04:24
  • Yes but on FPGAs that is not true since they cannot beat GPUs in terms of number of floating point calculations.So I am thinking to test fixed point to see how faster it is than floating point. – Lorenzo Cucurachi Sep 25 '17 at 19:33
  • Good point. I'm not familiar with FPGA OpenCL implementations and was only speaking in terms of CPU or GPU. – Dithermaster Sep 26 '17 at 19:41

2 Answers2

1

I have found GPU's great only for floats. I will give you with some CUDA C++11 / C++14 tips:

-use normalized float range [-1.0,+1.0] for greatest accuracy and store normalizing value separately (acumulated double),

-if data is high range anyway (big numbers division ends with lossy normalization) normalize as median subbstraction (stored separately as uint64_t) = big numbers will be stored with smaller accuracy. One can use a trimmed mean f.e. 5% instead median,

-sort and normalize periodically,

-in 2017 use new GTX1080ti (GFLOPS/USD; GFLOPS/W) or used GTX 770,

-high-end FPGA's are great if they are used as preprocessing units after ADC's or within (high demands for low power) embeded systems (typically network switches, media processing f.e. video, realtime FFT devices, et cetera). Moreover even greatest models of these ultra low power computational devices rarely exceeds few hundreds of GFLOPS for 1500$. It is equal to brand-new, off-the-shell and majority-of-problem-solved-on-NVidia-forum GT730 4GB GDDR5 by Palit for 35$,

-get a few dozen dollars book "CUDA by examples" et al. J. Sanders, free YT course "Udacity intro to parallel programming" and great book "professional CUDA programming" et al. J. Cheng to become CUDA C++11 intermediate programmer in three, full-time months,

-make by yourself research for fixed point arithemtics intended for older sequentional CPU's to get some conclusion that there are only limited libraries for cos, squre root and other basis. More complicated functions are problematic and there is no big community support for solving errors. In the end you will find that there are no speed-ups on FPU's, or smaller than order of magnitude for such big effort (writing everything from scratch),

-buy (minimum microarchitecture Keppler) GPU (since popular GTX670) for 50$ from some not well educated teenager,

-install Ubuntu, get GNU Octave and please-cite-GNU Parallel for majority of non-GPU problem solving,

-use FPGA to develop high-end ASIC for massive production.

Post Scriptum: user #WhatsACreel from YouTube could write some fixed point functions for you- write him an email with some honest offer. On his channel he explains basis of fixed point arithemtic.

  • To be honest I appreciate your answer and yes I am achieving better performances with CUDA on GPU but I need to experiment with FPGAs to see what I can achieve. How do you know fixed point it's not really that faster? – Lorenzo Cucurachi Sep 26 '17 at 17:43
0

Is spite of common misconceptions about FPGAs vs. GPUs, FPGAs have shown very impressive results. More information on FP16, and INT8 can be found here: https://www.altera.com/content/dam/altera-www/global/en_US/pdfs/literature/wp/wp-01269-accelerating-deep-learning-with-opencl-and-intel-stratix-10-fpgas.pdf Although OpenCL is not a library based approach for FPGAs, there are plenty of examples from Altera/Intel and XILINX with different data types. https://www.altera.com/products/design-software/embedded-software-developers/opencl/developer-zone.html and https://github.com/Xilinx/SDAccel_Examples More important than data width and types are data movement and data-reuse aspects of the algorithm IMHO. How V100 got boost in performance vs. P100 - by clever scheduling, doing zero copy with hardware assist, avoiding DRAM traffic and doing tensor trasposes in hardware of GPU. https://devblogs.nvidia.com/tensor-core-ai-performance-milestones/ FPGAs are no different. To get apple-to-apple performance benchmarks one has to learn these tricks and implement them on FPGA in the OpenCL or C (HLS) code.

My Name
  • 151
  • 1
  • 7