Tree search operations are not so simple to be implemented in CUDA. There are some papers, like the one
And another rather simple implementation (not quite a massively parallelized implementation in my opinion)
- "Accelerating Large Graph Algorithms on the GPU Using CUDA"
Pawan Harish and P. J. Narayanan
The difficulty comes from the fact that, tree operations generally involve decision making and according to the decisions different branches are taken. So massively parallelizing the operations without overlapping and making redundant operations is quite hard.
There are some approaches which use Stack and Queue implementations to traverse Trees.
You may find a similar question in here:
Error: BFS on CUDA Synchronization