BTW, operand data embedded right into an instruction is called "immediate" data.
It's not how modern CPUs work, but having a data bus narrower than the longest instruction is not actually a problem.
8086 for example did have to deal with instruction encodings that are wider than its 16-bit data bus, without any L1 cache to hide that effect.
As I understand it, 8086 just keeps reading words (16 bits) into a decode buffer until the decoder sees a whole instruction at once. If there's a leftover byte, it's moved to the front of the decode buffer. Instruction fetch for the next insn actually happens in parallel with the execution of the just-decoded instruction, but code-fetch was still the major bottleneck in 8086.
So the CPU just needs a buffer as large as the largest allowed instruction (excluding prefixes). That's 6 bytes for 8086, and this is exactly the size of 8086's prefetch buffer.
The "until the decoder sees a whole instruction" is a simplification: 8086 decodes prefixes separately, and "remembers" them as modifiers. 8086 lacks the 15-byte max total insn length limitation of later CPUs, so you could fill a 64k CS segment with repeated prefixes on one instruction).
Modern CPUs (like Intel P6 and SnB families) fetch code from L1 I-cache in at least 16B chunks, and actually decode multiple instructions in parallel. @Harold's nicely covers the rest of your question.
See also Agner Fog's microarch guide, and other links from the x86 tag wiki to learn more about how modern x86 CPUs work, in detail.
Also, David Kanter's SandyBridge writeup has details of the front-end for that microarchitecture family.
