Well for the example you give of GetPixel, it is slow because it uses a kernel mode driver to do the actual work, and in that driver it does a number of validation and locks to see if the device context you passed is actually a DC and to make sure it isn't changed somewhere in the function, then it makes a copy of an area into a new bitmap in memory and reads the pixel you want from that and after that deallocates the bitmap.
So you have a kernel mode switch, locks, validations and memory allocation, copying, and freeing and then another mode switch back to user land, all of which take time, finding a way to do GetPixel functionality in your program will save you tens of thousands of clock cycles.
But another API call may well cost no more than a few memory reads and writes, so it depends very much on which call into the OS you make.