Speech recognition with python-telegram-bot without downloading an audio file

Question

I'm developing a telegram bot in which the user sends a voice message, the bot transcribes it and sends back what was said in text. For that I am using the python-telegram-bot library and the speech_recognition library with the google engine. My problem is, the voice messages sent by the users are .mp3, however in order to transcribe them i need to convert them to .wav. In order to do that I have to download the file sent to the bot. Is there a way to avoid that? I understand this is not an efficient and a safe way to do this since many active users at once will result in race conditions and takes a lot of space.


def voice_handler(update, context):
    bot = context.bot
    file = bot.getFile(update.message.voice.file_id)
    file.download('voice.mp3')
    filename = "voice.wav"
    
    # convert mp3 to wav file
    subprocess.call(['ffmpeg', '-i', 'voice.mp3',
                         'voice.wav', '-y'])

    # initialize the recognizer
    r = sr.Recognizer()
    
    # open the file
    with sr.AudioFile(filename) as source:
    
        # listen for the data (load audio to memory)
        audio_data = r.record(source)
        # recognize (convert from speech to text)
        text = r.recognize_google(audio_data, language='ar-AR')
        
        
def main() -> None:
    updater.dispatcher.add_handler(MessageHandler(Filters.voice, voice_handler))

some function can use `file-like` object instead of `filename` and you can use `io.BytesIO` to create `file-like` object in memory - and you can write and read it as normal file. — furas, Jun 24 '22 at 13:57
you run external program `ffmpeg` so you may have to save it in file - because it can't run with Python object. Eventually you can check if `ffmpeg` can work with stream from `stdin` and send result to `stdout` — furas, Jun 24 '22 at 13:59
funny is `Speech Recognition` uses `Google Speech-To-Text` and it has to send `wav` but `Speech-To-Text API` works also with `mp3` - Maybe you use directly `Speech-To-Text API` but this needs to register own application on Google to get `API Key` - [Speech-To-Text](https://cloud.google.com/speech-to-text) — furas, Jun 24 '22 at 14:12
here is [source code](https://github.com/Uberi/speech_recognition/blob/master/speech_recognition/__init__.py#L858) in which `Speech Recognition` uses `Google Speech-To-Text`. Funny is this code convert `wav` to `flac` before sending. But you could modify it to send directly `mp3` (without using `AudioData` and without converting to `flac`). — furas, Jun 24 '22 at 14:37

score 0 · Answer 1 · answered Jun 25 '22 at 06:20

As pointed out in the comments, one option could be to download the file to memory and not to disk. If that does not work out for you, you can just give the file a unique id each time - e.g. use the users user_id or even an uuid - which will prevent files from being overridden.

furas · Answer 2 · 2022-06-25T13:41:37.373

Funny is Speech Recognition uses Google Speech-To-Text and it has to get wav but documentation for Google Speech-To-Text API shows that it can works also with mp3 and few other formats. See all supported audio encodings

When I checked source code for Speech Recognition then I saw it gets wav but it converts it to flac before sending to Google Speech-To-Text.

You can try to use directly Speech-To-Text API but this may need to register own application on Google to get API Key. See more Speech-To-Text

EDIT:

I took source code in which Speech Recognition uses Google Speech-To-Text and and I took some code from Google documentation and I created own version which can send directly mp3.

It use API Key from Speech Recognition - 'AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw'

import requests
import base64

#filename = 'test/audio/audio2-hello-world-of-python.wav'
filename = 'test/audio/audio2-hello-world-of-python.mp3'

with open(filename, 'rb') as fh:
    file_data = fh.read()

# --- Google Speech-To-Text ---

data = {
  "audio": {
    "content": base64.b64encode(file_data)
  },
  "config": {
    "enableAutomaticPunctuation": True,
#    "encoding": "LINEAR16",  # WAV
    "encoding": "MP3",        # MP3 
    "languageCode": "en-US",
    "model": "video",
  }
}

payload = {
    'key': 'AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw',
}

url = 'https://speech.googleapis.com/v1p1beta1/speech:recognize'
response = requests.post(url, params=payload, json=data)

#print(response.status_code)
#print(response.text)

data = response.json()
text = data['results'][0]['alternatives'][0]['transcript'] 
print(text)

In code I read file from disk but using io.Bytes probably you get data from bot without writing on disk.

file = bot.getFile(update.message.voice.file_id)
with io.Bytes() as fh:
    file.download(fh)
    #fh.seek(0)  # move to the beginning of file
    #file_data = fh.read()
    file_data = fh.getvalue()

EDIT:

Minimal working bot code - which I tested with uploaded files .mp3 (not with voice)

import os
import telegram
from telegram.ext import Updater, MessageHandler, CommandHandler, Filters
import requests
import base64
import io

# --- functions ---

def speech_to_text(file_data, encoding='LINEAR16', lang='en-US'):
    
    data = {
      "audio": {
        "content": base64.b64encode(file_data)
      },
      "config": {
        "enableAutomaticPunctuation": True,
    #    "encoding": "LINEAR16",  # WAV
    #    "encoding": "MP3",        # MP3
        "encoding": encoding, 
        "languageCode": lang,
        "model": "video",
      }
    }
    
    payload = {
        'key': 'AIzaSyBOti4mM-6x9WDnZIjIeyEU21OpBXqWBgw',
    }

    url = 'https://speech.googleapis.com/v1p1beta1/speech:recognize'
    response = requests.post(url, params=payload, json=data)
    #print('response:', response.text)

    try:
        data = response.json()
        return data['results'][0]['alternatives'][0]['transcript']
    except Exception as ex:
        print('Exception:', ex)
        print('response:', response.text)
        #return None
    
    #return None
        
# --- init ---

TOKEN = os.getenv('TELEGRAM_TOKEN')

bot = telegram.Bot(TOKEN)

updater = Updater(token=TOKEN, use_context=True)
dispatcher = updater.dispatcher

# --- commands ---

# - upload audio file -

def translate_audio(update, context):
    print('translate_audio')

    with io.BytesIO() as fh:
        #context.bot.get_file(update.message.voice.file_id).download(out=fh)
        context.bot.get_file(update.message.audio.file_id).download(out=fh)
        file_data = fh.getvalue()
        
    text = speech_to_text(file_data, 'MP3')
    if not text:
        text = "I don't understand this file"

    update.message.reply_text(text)

dispatcher.add_handler(MessageHandler(Filters.audio, translate_audio))

# - record voice -

def translate_voice(update, context):
    print('translate_voice')

    with io.BytesIO() as fh:
        context.bot.get_file(update.message.voice.file_id).download(out=fh)
        #context.bot.get_file(update.message.audio.file_id).download(out=fh)
        file_data = fh.getvalue()
        
    text = speech_to_text(file_data, 'MP3')
    if not text:
        text = "I don't understand this file"

    update.message.reply_text(text)

dispatcher.add_handler(MessageHandler(Filters.voice, translate_voice))

# --- start ---

print('starting ...')    
updater.start_polling()
updater.idle()

Speech recognition with python-telegram-bot without downloading an audio file

2 Answers2