My questions are about warps and scheduling. I'm using NVIDIA Fermi terminology here. My observations are below, are they correct?
A. Threads in the same warp execute the same instruction. Each warp includes 32 threads.
According to the Fermi Whitepaper: "Fermi’s dual warp scheduler selects two warps, and issues one instruction from each warp to a group of sixteen cores, sixteen load/store units, or four SFUs. "
From here, I think a warp(32 threads) is scheduled twice since 16 cores out of 32 are grouped together. Each scheduler issues half of a warp to 16 cores in a cycle, and in all, two schedulers issue two warp-halves into two 16-core scheduling groups in a cycle. In another words, one warp needs to be scheduled twice, half by half, in this Fermi architecture. If a warp contains only SFU operations, then this warp needs to be issued 8 times(32/4), since there's only 4 SFPUs in an SM.
B. When a large amount of threads (say 1-D array, 320 threads) is launched, consecutive threads will be grouped into 10 warps automatically, each has 32 threads. Therefore, if all threads are doing the same work, they will execute exactly the same instruction. Then all warps are always carrying the same instruction in this case.
Questions: Q1. Which part handles the threads grouping (into warps)? software or hardware? if hardware, is it the warp scheduler? and how the hardware warp scheduler is implemented and work?
Q2. If I have 64 threads, threads 0-15 and 32-47 are executing the same instruction while 16-31 and 48-63 executes another instruction, is the scheduler smart enough to group nonconsecutive threads( with the same instruction) into the same warp (i.e., to group threads 0-15 and 32-47 into the same warp, and to group threads 16-31 and 48-63 into another warp)?
Q3. What's the point to have a warp size(32) larger than the scheduling group size(16 cores)?(this is a hardware question) Since in this case(Fermi), a warp will be scheduled twice (in two cycles) anyway. If a warp is 16 wide, simply two warps will be scheduled (also in two cycles), which seems the same with the previous case.I wonder whether this organization is due to performance concern.
What I can imagine now is: threads in the same warp can be guaranteed synchronized which can be useful sometimes, or other resources such as registers and memory are organized in the warp size basis. I'm not sure whether this is correct.