In a strict sense, only step 4 is implemented by VoiceXML. Other aspects are handled by the platform or external code. VoiceXML is the standards mechanism for implementing step 4, but if all you are going to do is limited audio output and simple input, it may be overkill depending on the solutions available to you.
The following is just an example of a way to solve your problem and is fairly fictitious given I don't know anything about your environment nor constraints.
Given most VoiceXML platforms, upon receiving of a call your VoiceXML application will be executed. If this is a servlet/ASP based solution, you can perform steps 2 & 3 then generate/return the VoiceXML to play the menu, gather the input and move to the next step. If this is a static VoiceXML 2.1 solution, you can use a Data element call to make an HTTP request to a system that can perform these actions. The system will need to return XML that the Javascript/ECMAScript in VoiceXML application can parse and provide the correct audio output and input processing.
Since you are asking about VoiceXML, I'm assuming your challenge is the telephony aspect of the problem. Unless you have a system already available, choosing and activating a premise or hosted solution is far more complicated than the call flow code involved. Depending on your requirements, there are solutions as low as a single line, analog modem that supports audio output and DTMF input to massively scaled on premise and hosted solutions to handle 10,000s of concurrent calls that implement VoiceXML as well as a wide range of other call flow technologies.