Quite a lot of speech recognition softwares depend on HMM or Hidden Markov Model. This approach works on the assumption that a speech signal, when viewed on a short enough timescale (say, ten milliseconds), can be reasonably approximated as a stationary process - meaning, a process in which statistical properties do not change over time. The speech is divided into 10 mm fragments and is mapped to a vector of real numbers known as cepstral coefficients and then these vectors are matched to Phonemes. This is a very high overview of a typical speech recognition system.
Now, coming back to the requirement that you have, a little research would have brought you to libraries like -
Now using SpeechRecognition is as simple as (taken from source code and tried on my computer) -
import speech_recognition as sr
from os import path
AUDIO_FILE = path.join(path.dirname(path.realpath(__file__)), "english.wav")
r = sr.Recognizer()
with sr.AudioFile(AUDIO_FILE) as source:
audio = r.record(source) # read the entire audio file
try:
print("Sphinx thinks you said " + r.recognize_sphinx(audio))
except sr.UnknownValueError:
print("Sphinx could not understand audio")
except sr.RequestError as e:
print("Sphinx error; {0}".format(e))
And voila, it works, in ten lines of code, thanks to amazing people developing these :)
Edit - You need to have PocketSphinx set up for this code to work.