In D3D9, the driver architecture is such that resources have to be validated when they are used. This increases the overhead of many API calls, and is part of the reason you should optimize to do more with fewer API calls.
In addition, on older Windows platforms (e.g Windows XP) the D3D driver was completely in kernel-mode, so API calls would invoke a user-mode to kernel-mode context switch (this is not the case in Windows Vista, 7 or 8, which have a user-mode front-end like OpenGL).
In D3D10, resources are validated only when they are created. Likely because D3D10 is layered on top of WDDM, which made the switch from a full kernel-mode to partially user-mode D3D runtime. In WDDM if the D3D runtime crashes it will not cause a kernel panic (BSOD), so validation is not as important. You do not have to be nearly as paranoid about these things when you're running in user-mode.
Now, as for the performance between 8-bit integer and 16-bit fp, this is actually to be expected. Not so much because one is integer and the other is FP (GPUs are great with FP), but because one is twice the size of the other. GPUs have a lot of memory bandwidth, but you can still improve performance simply by using the smallest data type possible.