1) Yes, there is (or can be) a performance difference, depending on the actual numbers and on the used hardware!
GPUs (usually) contain multiple so-called "waves" of threads. These waves work in a SIMD-like fashion (All threads in a wave are allways executing the same operations at the same time). The exact number of threads per wave is vendor-specific, but is usually 32 (all NVidia GPUs I know of) or 64 (most AMD GPUs).
A single group of threads can be distributed to multiple waves. However, a single wave can only execute threads of the same group. Therefore, if your number of threads per group is not an multiple of the hardware's wave size, there are some threads in a wave that are "idling" (They are actually doing the same things as the other ones, but aren't allowed to write into memory), so you are "loosing" performance that you would get with an better number of threads.
2) You would most likely select a thread count that's suitable to your hardware (64 would be a good default value, as it is also a multiple of 32), and use branching to mark threads as "inactive" that are outside of your matrix (you can pass the size of the matrix/data to the shader using a constant buffer). Since these inactive threads aren't doing anything at all, the hardware can simply mask them as "read-only" (similar to how they would be handled if the number of threads per group is smaller then the wave size), which is quite cheap. If all threads in a wave are marked inactive, the hardware can even choose to skip the work of this wave completly, which would be optimal.
You could also use padding to make sure that your matrix/data is allways a multiple of the number of threads per group, eg with zeroes or the identity matrix or whatever. However, whether this can be done depends on the application, and I would assume that branching would be as fast - if not faster - in most cases.