As stated already, you could probably get a speedup from implementing your
vector math using SSE instructions (be aware of the effects discussed
here - also for the other approach). This approach would allow the code
stay concise and maintainable.
I assume, however, your question is about "packet traversal" (or something
like it), in other words to process multiple scalar values each of a
different ray:
In principle it should be possible deferring the shading to another pass.
The SIMD packet could be repopulated with a new ray once the bare marching
pass terminates and the temporary result is stored as input for the shading
pass. This will allow to parallelize a certain, case-dependent percentage
of your code exploting all four SIMD lanes.
Tiling the image and indexing the rays within it in Morton-order might be
a good idea too in order to avoid cache pressure (unless your geometry is
strictly procedural).
You won't know whether it pays off unless you try. My guess is, that if it
does, the amount of speedup might not be worth the complication of the code
for just four lanes.
Have you considered using an SIMT architecture such as a programmable GPU?
A somewhat up-to-date programmable graphics board allows you to perform
raymarching at interactive rates (see it happen in your browser here).