Although both are the same in terms of performance, you should take into account that the first one is not recommended.
Return a thread within a kernel could cause an unexpected behaviour in the rest of your code.
By unexpected behaviour I mean whatever problem related to the minimum unit of threads that are grouped in a warp. In example if you have an if / else
block in your kernel, this situation is known as thread divergence and in a normal case it results in threads remaining idle and others executing some instructions.
CUDA by Example Book, Chapter 5, Thread Cooperation:
But in the case of __syncthreads(), the result is somewhat tragic. The CUDA Architecture guarantees that no thread will advance to an instruction beyond the __syncthreads() until every thread in the block has executed the __syncthreads()
So, it is mainly related to the threads synchronization within a kernel. You can find a very good question / answer about this topic here: Can I use __syncthreads() after having dropped threads?
As I final note, I've also used that bad practice and no problem appeared but there is no guarantee that problems may arise in the future. It is something that I would not recommend