I do not think they use Cooley-Tuckey algorithm because its index permutation phase makes it not very convenient for shared-memory architectures. Additionally this algorithm works with power-of-two memory strides which is also not good for memory coalescing. Most likely they use some formulation of Stockham self-sorting FFT: for example Bailey's algorithm.
What concerns the implementation, you are right, usually one splits a large FFT into several smaller ones which fit perfectly within one thread block. In my work, I used 512- or 1024-point FFTs (completely unrolled of course) per thread block with 128 threads. Typically, you do not work with a classical radix-2 algorithm on the GPU due to large amount of data transfers required. Instead, one chooses radix-8 or even radix-16 algorithm so that each thread performs one large "butterfly" at a time. For example implementations, you can also visit Vasily Volkov page, or check this "classic" paper.