0

I have to develop a bubble sort algorithm with AVX instructions with single precision numbers in input. Can anyone help me to look for the best implementation?

I did a bubble sort version for SSE3:

global sort32

sort32: start

    mov eax, [ebp+8]        ; float* x
    mov ebx, [ebp+12]       ; int n

    call    sort

    stop

    ; --------------------------------------------------
    ; Inserire qui il proprio algoritmo di ordinamento
    ; --------------------------------------------------
    ; eax = vector start address
    ; ebx = vector length
    ; --------------------------------------------------

sort:   
    mov edi, ebx    ;contatore ciclo esterno
    sub edi, 4

ciclo_esterno:
    mov esi, 0      ;contatore ciclo interno

ciclo_interno:
    movups  xmm0, [eax+esi*4]
    jmp     verifica

; controllo se l'array da 4 non è già ordinato
verifica:
    movaps  xmm4, xmm0
    shufps  xmm4, xmm4, 10010000b
    cmpleps xmm4, xmm0
    movmskps edx, xmm4
    cmp     edx,    15
    je  incremento  

    movaps  xmm1, xmm0
    movhlps xmm1, xmm0

    movaps  xmm4, xmm0  ;confronto
    minps   xmm0, xmm1
    maxps   xmm1, xmm4

    shufps  xmm1, xmm1, 11100001b   ;inverto i massimi e riconfronto

    movaps  xmm4, xmm0  ;confronto
    minps   xmm0, xmm1
    maxps   xmm1, xmm4

    movaps  xmm4, xmm0
    shufps  xmm4, xmm4, 11100001b   ; confronto la coppia dei minimi

    cmpltps xmm4, xmm0
    movmskps edx, xmm4
    cmp     edx, 2
    je  cmp_max
    shufps  xmm0, xmm0, 11100001b   ; non sono ordinati all'interno quindi scambio

cmp_max:
    movaps  xmm4, xmm1
    shufps  xmm4, xmm4, 11100001b   ; confronto la coppia dei massimi

    cmpltps xmm4, xmm1
    movmskps edx, xmm4
    cmp edx, 2
    je  unisci
    shufps  xmm1, xmm1, 11100001b   ; non sono ordinati all'interno quindi scambio

unisci:
    movlhps xmm0, xmm1
    movups  [eax+esi*4], xmm0

incremento: 
    inc esi
    cmp esi, edi
    jg aggiorna_edi
    jmp ciclo_interno

aggiorna_edi:
    dec edi
    cmp edi, 0
    jl endl
    jmp ciclo_esterno   

endl:   ret
Paul R
  • 208,748
  • 37
  • 389
  • 560
Frank
  • 730
  • 2
  • 9
  • 20

1 Answers1

3

Most sorting algorithms do not generally lend themselves to SIMD implementation. You might want to consider using a network sort algorithm instead, which is relatively simple to implement with SIMD for small numbers of elements. For larger sorts you can use the network sort as the inner "kernel" of a larger non-SIMD sort algorithm.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • But for my problem i need to implement this type of algorithm. I did the version for sse3. This is my code and it work: http://pastebin.com/EimcJdQg So i have to implement it using AVX – Frank Jul 01 '13 at 15:37
  • Did you *measure* the performance of your SSE code ? Was it any faster then scalar code ? – Paul R Jul 01 '13 at 17:04
  • the version of sse3 at 32 bit compile 100000 random numbers in 23 sec. This version http://pastebin.com/bmQtNKrq in avx in 33 seconds. Damn. i have to improve this performance – Frank Jul 02 '13 at 16:58
  • 1
    Well bubble sort is such a poor sorting algorithm in the first place (O(n^2)) that no amount of code optimisation will make up for this. You really ought to look at better sorting algorithms (see suggestions above). – Paul R Jul 02 '13 at 18:13
  • i know this, but i have to do this for a little project and we have to implement this algorithm in the best way. So any help for my code? many thanks – Frank Jul 02 '13 at 18:58
  • If you can use SSSE3/SSE4 then you can make your existing SSE code a bit more efficient. For AVX/AVX2 the code will be very similar, but with double width vectors. – Paul R Jul 02 '13 at 19:10
  • Using shuffle commands? – huseyin tugrul buyukisik Jul 02 '13 at 19:47
  • Yes - also look at the permute instructions. – Paul R Jul 02 '13 at 20:27
  • THanks for answers. the problem is that I can't find a better solution even in SSE3. I'm using the instructions shuffle in both versions (SSE3 and AVX) and I think that in order to improve the algorithm (in AVX) I need to know how to implement a comparison of the numbers in a vector. Let me explain. in a vector AVX there are 8 single-precision elements. my code works with 4 elements at time because I dont know how to work directly with all 8 numbers. Do you have any idea? – Frank Jul 02 '13 at 21:11
  • SSSE3 (not SSE3) has `PSHUFB` (`_mm_shuffle_epi8`) which might be useful and could reduce the number of branches. – Paul R Jul 02 '13 at 21:23
  • For AVX (can you also assume AVX2, i.e. Haswell ?) you can do much the same thing - for most instructions the 8 elements are actually 2 x 4 element "lanes". – Paul R Jul 02 '13 at 21:24
  • So Which permute operation can I use for resolving the sorting problem in a YMM register? – Frank Jul 03 '13 at 08:18
  • You will want to use one of the few which works across lanes, e.g. `_mm256_permutevar8x32_ps` (`VPERMPS`), but note that this is AVX2 only (which is why I asked earlier whether you are limited to AVX only or whether you can assume AVX2, i.e. Haswell ?). – Paul R Jul 03 '13 at 08:57
  • I can't use AVX2. :/ but i try to focalize my problem. my code is based on comparisons, I put the 2 high elements of ymm0 in the low ymm1, then i put the max in ymm1 and min in ymm0 the problem is to put ymm0 elements in ymm1 because with SSE I have the `movhlps` that doesn't work in the same way in AVX so I have to use `vshufps` that is heavier. so how can I use permute operations? – Frank Jul 03 '13 at 09:33
  • You're going to have a lot of problems doing this efficiently with AVX only. All I can suggest is that you implement your current SSE routine treating each half of the YMM register separately, i.e. process 2 x 128 bit vectors at once using the same approach as for SSE. The branches could be a problem though, so you'll end up with some redundancy. – Paul R Jul 03 '13 at 09:39