Real-time Phone call transcription using Twilio and Deepgram

Question

Using Twilio media streams, I want to transcribe outgoing calls. To transcribe real-time audio, Deepgram transcription API used. I was curious about the types of audio that the Twilio stream returned and the types of audio that the Deepgram transcription api required.

I decoded and converted the Twilio return stream into a wave file, which I then sent to the Deepgram API. But, Deepgram's API returns a JSON object with an error.


// deepgram websocket connection initiated
 deepgram= new WebSocket('wss://api.deepgram.com/v1/listen', {
        headers: {
          Authorization: `Token c5a8a4337xxxxxxxxxx38e56456a52557a5`,
        },
      });

// Condition true when message from the twilio is media
 if (msg.event === 'media') {
      if (deepgram.readyState == WebSocket.OPEN) {
         const twilioData = msg.media.payload;
         // Build the wav file from scratch since it comes in as raw data
         let wav = new WaveFile();

         // Twilio uses MuLaw so we have to encode for that
         wav.fromScratch(1, 8000, '8m', Buffer.from(twilioData, 'base64'));

         // This library has a handy method to decode MuLaw straight to 16-bit PCM
         wav.fromMuLaw();

         // Get the raw audio data in base64
         const twilio64Encoded = wav.toDataURI().split('base64,')[1];

         // Create our audio buffer
         const twilioAudioBuffer = Buffer.from(twilio64Encoded, 'base64');

         // Send data starting at byte 44 to remove wav headers so our model sees only audio data
         chunks.push(twilioAudioBuffer.slice(44));

         // We have to chunk data b/c twilio sends audio durations of ~20ms and AAI needs a min of 100ms
          const audioBuffer = Buffer.concat(chunks);
          const encodedAudio = audioBuffer.toString('base64');
          deepgram.send(encodedAudio);
        
      }
    }

Response Received from Deepgram API

{
  type: 'Error',
  variant: 'SchemaError',
  description: 'Could not deserialize last text message: expected value at line 1 column 1',
  message: 'KAAoACgAGADo/9j/yP/I/9j/+P8YADgASAA4ABgA+P/I/7j/uP/I/+j/GABIAFgASAAoAOj/uP+o/6j/yP/o/ygAWABoAFgAGADY/6j/iP+I/7j/CABIAHgAeABYACgA2P+o/4j/mP+4/wgASACEAIQAaAAYALj/bP9M/2z/uP8oAHgApACUAGgA+P+Y/1z/TP9s/7j/OACEAMQAxACEABgAmP88/wz/PP+o/zgApADkANQAeADo/2z/HP8c/2z/2P94ANQA5AC0AEgAqP8s//z+HP98/xgAtAAUASQB1ABIAIj/DP/M/uz+fP8oANQANAE0AcQAKABc/9z+rP78/oj/SADkACQBBAGEANj/PP/c/uz+TP/Y/5QABAEkAeQAWACo...
}

score 0 · Answer 1 · answered May 31 '23 at 08:33

The reason you are getting this error is because DeepGram's send() API accepts a string | Buffer. The string that DeepGram accepts is of format {type:...}. e.g. {type: "KeepAlive"}, but you are sending it base64 audio data. You can actually just pass a Buffer into the send() API straight from the Twilio steam payload without having to convert it into Wav.

See below:

 if (msg.event === 'media') {
  if (deepgram.readyState == WebSocket.OPEN) {
     const twilioData = msg.media.payload;
      deepgram.send(Buffer.from(twilioData,"base64"));
  }
}

Make sure to configure your DeepGram instance to accept Mulaw data at 8000Hz: Code taken from DeepGram website: https://developers.deepgram.com/docs/getting-started-with-live-streaming-audio

// Initialize the Deepgram SDK
const deepgram = new Deepgram(deepgramApiKey);

// Create a websocket connection to Deepgram
// In this example, punctuation is turned on, interim results are turned off, and language is set to UK English.
const deepgramLive = deepgram.transcription.live({
    punctuate: true,
    interim_results: false,
    language: "en-US",
    model: "nova",
    encoding: "mulaw", <- IMPORTANT: set encoding
    sample_rate: 8000  <- IMPORTANT: set sample rate 
});

Add a listener to the deepgramLive instance and you are on track. All the code is easily accessible on the link.

Real-time Phone call transcription using Twilio and Deepgram

1 Answers1