Robust VAD is a non-trivial problem, and there are many approaches.
The approach you take depends on factors such as:
- the specifics of your application context and how your application will be used
- what sort of assumptions you can make about the audio you will be processing (what types of background noise or non-voice audio you can expect)
- whether or not your system needs to operate in real-time
A simple approach might involve taking a "bag of features" (e.g. f0, noisiness, magnitudes of first 10 partials) post-noise reduction for each audio frame, and training a machine learning algorithm (SVM would suffice) with a wide selection of voice and non-voice exemplars.
However, it is probably best not to treat VAD a a simple framewise audio classification problem, but rather to take time varying aspects of the audio into account. This will give you a better estimate of where speech segments begin and end. For this you could use an envelope follower or spectral flux. You could set a high and low threshold on these envelope values, and use these (for example) to control a gate on the audio stream.