As much as it is said about WebRTC being peer to peer, there are some lesser known facts. Peer to peer is not always possible due to the inconsistency of the internet architecture, most prominently, the symmetric NATs (Network address translators), which is normally the case with mobile networks and some 'badly behaved' networks.
For most other networks, with WebRTC, you do not need to send data through your server, as it will connect peer to peer by using STUN protocol to know public socket details and hole punching to actually transmit data. You need the server to setup the signalling part as it is not a part of WebRTC. For signalling, you may use protocols like SIP, websockets etc.
Having said that, as a failsafe mechanism when p2p is not possible, the approach you may take is to route the traffic through your server. The good thing is that WebRTC provides the support for this approach is using TURN servers. ICE is used to identify the best case scenario ( p2p possibility by using STUN or routing data through TURN server ). The thing to note is that the latter is not peer to peer and to route the data through a TURN server requires a high bandwidth TURN server incurring exorbitant costs.
Now, let me address some incorrect assumptions in your points:
1. You can handle it as you have stated.
2. This step will be accomplished by the TURN server. Internally, websockets is not used by WebRTC. It uses SRTP ( RTP over SSL ) at application layer and TCP or UDP ( depending on the firewall traversal and reliability requirements ). So, websockets is not possible with WebRTC. That is a completely different approach.
3. Same as point 2.
4. No on-the-fly conversion is normally done or recommended ( the lag in conversion will take away the Real time feature out the window ). Any such conversion should be done in step 1.
Before the session is initiated, the SDP (session description protocol) relays the codecs for audio and video to both clients in the signalling phase.
5. Once again, the thing to note is that after the session is initialized, whether p2p or through TURN server, the data should
flow uninterrupted to both the clients. It's the essence of WebRTC.
If you want something else, try websockets. It is doesn't need anything more than websocket support at the client and server side. It uses all the architecture of TCP-IP-HTTP protocol stack except that it replaces HTTP with websockets at the application layer with an upgrade request to the server. This allows bidirectional flow of data from server and clients and you are more free to do computations on the data.
There is a likely case where you may use websockets for signalling before initiating the WebRTC session between clients.
P.S. Due to less reputation, I can't post more than 2 links. Please use wikipedia for reference to the terms unclear to you.