0

I have millions of Eigen fp32 , Eigen::MatrixXf::Random(1, 512)

matrix subtract and also squaredNorm are using for calculations.

Is FPGA can make it faster than the CPU and what magnitude ? Is there any FPGA in the market that I can evaluate =

  Eigen::MatrixXf feat , b_cmp = Eigen::MatrixXf::Random(1, 512);
    for (int i = r.begin(); i < r.end(); ++i) {

            auto distance = (feat.row(i) - b_cmp).squaredNorm();
            if (distance < int xx ) {
              mutex1.lock();
              found_number.push_back(i);
              mutex1.unlock();
            }
          }
AILORD
  • 21
  • 3
  • Why go for an FPGA instead of a GPU? – Homer512 Aug 16 '22 at 06:16
  • For hardware recommendations try https://hardwarerecs.stackexchange.com/ – chtz Aug 16 '22 at 07:42
  • Hardware questions are off-topic. For programming questions make sure to provide a [mre] (please read that link). – chtz Aug 16 '22 at 07:46
  • You can likely increase the speed, if you are able to store `feat` as a row-major matrix. Not sure why you need the mutexes -- that loop alone does not involve any multi-threading. – chtz Aug 16 '22 at 07:57
  • @chtz. TBB involved above.. this is why muteness. DB size is greater than the GPU memory (100GB) . – AILORD Aug 16 '22 at 11:59
  • @Homer512. any example how to do it with GPU ? specially big DB sizes – AILORD Aug 16 '22 at 12:00
  • 1
    You can probably adapt something like this paper: https://vincentfpgarcia.github.io/data/Garcia_2010_ICIP.pdf (source code linked in document). Many other papers exist. Just search on Google Scholar https://scholar.google.com/scholar?q=gpu+search+high+dimensional+space As for the large size, just overlap computation with copying the next batch to the GPU: https://developer.nvidia.com/blog/how-overlap-data-transfers-cuda-cc/ – Homer512 Aug 16 '22 at 12:29
  • If you store your matrix row-major (`Eigen::Matrix`) your loop will be limited by memory throughput (I assume for any `feat` matrix not fitting into L3 cache) -- I doubt that FPGA or GPU will help you there. And for such large DBs you may consider a different format, maybe something like a k-d-tree. W.r.t. multi-threading, I would consider having separate `found_number` vectors per thread and join them at the end (that way, you will also have a more deterministic order, in case that matters). – chtz Aug 16 '22 at 21:55

0 Answers0