0

I'm trying to convert this scalar code:

struct CICDecimator {
    static constexpr int64_t scale = ((int64_t)1) << 32;
    int mStages;
    int mFactor;
    float mGainCorrection;
    int64_t *pIntegrators;
    int64_t *pCombs;

    CICDecimator(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators = new int64_t[mStages + 1]{};
        pCombs = new int64_t[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection = 1.0f / (float)(pow(mFactor, mStages));
        }
    }
    ~CICDecimator() {
        delete[] pIntegrators;
        delete[] pCombs;
    }

    float process(const float *buffer) {
        for (int i = 0; i < mFactor; i++) {
            pIntegrators[0] = buffer[i] * scale;

            for (int j = 1; j <= mStages; j++) {
                pIntegrators[j] += pIntegrators[j - 1];
            }
        }

        int64_t s = pIntegrators[mStages];
        for (int i = 0; i < mStages; i++) {
            int64_t t = s;
            s -= pCombs[i];
            pCombs[i] = t;
        }

        return mGainCorrection * (s / (float)scale);
    }
};

to the SSE/SIMD version (built with -march=nocona), trying to gain performances on process() function (which is called heavy, within an audio application).

The main problem is that buffer's input will be m128 (4 x float) values. And it seems CICDecimator need 64 bit integer for precision to work proper (so, only two slot).

Here's my actual code:

struct CICDecimatorV4 {
    const __m128 scale4 = _mm_set1_ps((float)(((int64_t)1) << 32));
    int mStages;
    int mFactor;
    __m128 mGainCorrection4;
    __m128d *pIntegrators4_1;
    __m128d *pIntegrators4_2;
    __m128d *pCombs4_1;
    __m128d *pCombs4_2;

    CICDecimatorV4(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators4_1 = new __m128d[mStages + 1]{};
        pIntegrators4_2 = new __m128d[mStages + 1]{};
        pCombs4_1 = new __m128d[mStages]{};
        pCombs4_2 = new __m128d[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection4 = _mm_div_ps(_mm_set1_ps(1.0f), _mm_set1_ps((float)pow(mFactor, mStages)));
        }
    }
    ~CICDecimatorV4() {
        delete[] pIntegrators4_1;
        delete[] pIntegrators4_2;
        delete[] pCombs4_1;
        delete[] pCombs4_2;
    }

    __m128 process(const __m128 *buffer) {
        for (int i = 0; i < mFactor; i++) {
            // float to double
            __m128i bufferInt = _mm_cvttps_epi32(_mm_mul_ps(buffer[i], scale4));
            pIntegrators4_1[0] = _mm_cvtepi32_pd(bufferInt);
            pIntegrators4_2[0] = _mm_cvtepi32_pd(_mm_srli_si128(bufferInt, 8));

            for (int j = 1; j <= mStages; j++) {
                pIntegrators4_1[j] += pIntegrators4_1[j - 1];
                pIntegrators4_2[j] += pIntegrators4_2[j - 1];
            }
        }

        __m128d sD1 = pIntegrators4_1[mStages];
        __m128d sD2 = pIntegrators4_2[mStages];
        for (int i = 0; i < mStages; i++) {
            __m128d t1 = sD1;
            __m128d t2 = sD2;

            sD1 -= pCombs4_1[i];
            sD2 -= pCombs4_2[i];

            pCombs4_1[i] = t1;
            pCombs4_2[i] = t2;
        }

        // double to float
        __m128 sF1 = _mm_cvtpd_ps(sD1);
        __m128 sF2 = _mm_cvtpd_ps(sD2);
        __m128 sF; // <<<< TODO: i need to merge the two float on a single one

        return _mm_mul_ps(mGainCorrection4, _mm_div_ps(sF, scale4));
    }
};

I converted int64_t to m128d (two slots both), split the input m128 to two m128d and duplicate each calculation for keep vector and double precision calculation.

Questions:

  1. is this approch correct?
  2. how can I convert two m128d to a single m128? (first on lower section, second on upper part) (i.e. the <<<< TODO part)
  3. can be the whole improved? (:P)

Thanks

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
markzzz
  • 47,390
  • 120
  • 299
  • 507
  • 2. Convert `m128d` to `m128` via `CVTPD2PS` and merge two `m128` to one via `SHUFPS` – MikeCAT Apr 15 '21 at 15:32
  • @MikeCAT: this way? `__m128 sF = _mm_shuffle_ps(sF1, sF2, _MM_SHUFFLE(1, 0, 1, 0));` – markzzz Apr 15 '21 at 15:41
  • 2
    You probably need `_mm_movelh_ps` instruction – Soonts Apr 15 '21 at 15:57
  • Integer version of the same thing: [How to efficiently convert from two \_\_m128d to one \_\_m128i in MSVC?](https://stackoverflow.com/q/39503088). Use `ps` instead of `epi32`, and `_mm_movelh_ps` instead of `_mm_unpacklo_epi64`. Oh, and there is an exact duplicate after all in my `site:stackoverflow.com convert __m128d to __m128` search results: [How to convert two \_pd into one \_ps?](https://stackoverflow.com/q/54517648). – Peter Cordes Apr 15 '21 at 22:33
  • Are you sure your loop logic makes sense at all? Did you check the values in the `__m128d` vectors with a debugger to see if packing them into a `__m128` is the actual problem or not? The title question is a duplicate. If you want to turn this into a debugging question for the vectorization problem in general, include a [mcve] with debugging results, like actual input and output / temporary values for tiny test cases. – Peter Cordes Apr 15 '21 at 22:46
  • 1
    @PeterCordes lol the funny thing is that was my own question :( long time I didn't work with that stuff hehe – markzzz Apr 16 '21 at 07:03
  • @PeterCordes I could use ps instead of epi32, but how can I store low/high part of float to two m128d? (the other way question). don't see any fancy intrinsincs on SSE. Also, from original code, I need to "truncate" at some point, that's why I pass through epi32... – markzzz Apr 16 '21 at 07:05
  • You mean when you set `pIntegrators4_1[0]` and `pIntegrators4_2[0]`? Use `movhlps` to bring the high half of a `__m128` down for the 2nd `_mm_cvtps_pd`. i.e. `v = _mm_movehl_ps(v,v)`. I was wondering why you were truncating to integer and back to FP again, but I assumed that was intentional for the rounding. https://agner.org/optimize/ has a table of data-movement instructions for different kinds of data movement, in the SIMD chapter of his asm optimization guide. Helpful for finding instructions to move data within a register, for example. – Peter Cordes Apr 16 '21 at 07:11
  • @PeterCordes: found why it crash. double can't manage as int64 value, so i need to cast one m128 to two m128i (and than back to m128). Try to investigate if this is possible with SSE, otherwise I'll open a new question. – markzzz Apr 16 '21 at 09:25
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231203/discussion-between-markzzz-and-peter-cordes). – markzzz Apr 16 '21 at 09:34

0 Answers0