Yes, in caching mode, a single 128byte transaction will be generated (as seen from the L1 cache level.) In uncached mode, four 32byte transactions will be generated (as seen from the L2 cache level - it's still a single 128byte request coming from the warp due to coalescing.) In the case you describe, the four 32byte transactions are not any slower, for a fully coalesced access, regardless of cached or uncached mode. The memory controller (on a given GPU) should generate the same transactions to satisfy the warp's request in either case. Since the memory controller is composed of a number (up to 6) of "partitions", each of which has a 64bit wide path, ultimately multiple memory transactions (coming across multiple partitions, perhaps) will be used to satisfy either request (4x32byte or 1x128byte). The specific number of transactions and organization across partitions may vary from GPU to GPU, (and isn't part of your question, but a GPU with DDR-pumped memory will return 16bytes per partition per memory transaction, and with QDR-pumped memory, will return 32bytes per partition per memory transaction). This isn't specific to CUDA 5 either. You might want to review one of NVIDIA's webinars for this material, in particular "CUDA Optimization : Memory Bandwidth Limited Kernels". Even if you don't want to watch the video, a quick review of the slides will remind you of the various differences between so-called "cached" and "uncached" accesses (this is referring to L1), and also give you the compiler switches needed to try each case.
Another reason to review the slides is that it will remind you of under what circumstances you might want to try "uncached" mode. In particular, if you have a scattered (uncoalesced) access pattern coming from your warps, uncached mode access may yield an improvement because there is less "wastage" when requesting 32byte quantities from memory to satisfy the request of a single thread as compared to 128byte quantities. However in response to your final question, it's fairly difficult to be analytical about it, because presumably your code is a mix of ordered and disordered access patterns. Since uncached mode is turned on via compiler switch, the suggestion given in the slides is simply to "try your code both ways" and see which runs faster. In my experience, running in uncached mode rarely yields a perf improvement.
EDIT: Sorry I had the link and title for the wrong presentation. Fixed slide/video link and webinar title.