The Cooley-Tukey algorithm can operate on a variety of DFT lengths which can be expressed as N = N_1*N_2. The algorithm recursively expresses a DFT of length N into N_1 smaller DFTs of length N_2.
As you note, the fastest is generally the radix-2 factorization, which recursively breaks a DFT of length N into 2 smaller DFTs of length N/2, running in O(NlogN).
However, the actual performance will depend on hardware and implementation. For example, if we are considering the cuFFT with a thread warp size of 32 then DFTs that have a length of some multiple of 32 would be optimal (note: just an example, I'm not aware of the actual optimizations that exist under the hood of the cuFFT.)
Short answer: the underlying code is optimized for any prime factorization up to 7 based on the Cooley-Tukey radix-n algorithm.
http://mathworld.wolfram.com/FastFourierTransform.html
https://en.wikipedia.org/wiki/Cooley-Tukey_FFT_algorithm