From what it sounds like is that you just took some serial code and compiled it thinking it would work.
But with the assumption that you actually have parallel code you might want to make sure you
- Use the architecture your card has. Under the properties -> CUDA C/C++ -> Device -> Code Generation make sure you have the correct value. For my card I have compute_35,sm_35. If your card supports Maxwell you can do compute_50,sm_50.
- You can change your optimization under the **-> CUDA C/C++ -> Optimization **
- Make sure you are not compiling with debug on.
- If all these fail you should use the NSIGHT Analysis Tool (Or the visual profiler) on your application to see where you might have some issues. Check to make sure you don't have bank conflicts if you are using shared memory, reduce divergence, etc. The visual profiler is pretty good about telling you what is wrong.
You should also check out the GTC talks on optimizations [link to pdf] (my old professor). It covers some basic optimizations that you can perform to get your code up to speed.
The talks from the last few years of GTC can be found here [link]. They have multiple updates to optimizations, talks about different tools and so forth.