1

I am having problem while converting audio file to text using google speech to text. I am able to download the file from Twilio but when I supply that audio file to google speech then it gives me 0 length response. But if I convert this downloaded file using vlc media player and then supply it to google speech then it gives me right output. Please help me on this I am stuck for about a week now.

After getting response from Twilio I save it in a file with .wav extension

InputStream in = new URL(jsonObject.get("redirect_to").toString()).openStream();
Files.copy(in, Paths.get("src/main/resources/mp.wav"), StandardCopyOption.REPLACE_EXISTING);

Below is the google speech to text code.

Path path = Paths.get("src/main/resources/mp.wav");
        byte[] content = Files.readAllBytes(path);
        ByteString audioBytes = ByteString.copyFrom(content);

        try (SpeechClient speech = SpeechClient.create()) {
            RecognitionConfig recConfig =
                    RecognitionConfig.newBuilder()
                            .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
                            .setLanguageCode("en-US")
                            .setSampleRateHertz(44100)
                            .setModel("default")
                            .setAudioChannelCount(2)
                            .build();


            RecognitionAudio recognitionAudio = RecognitionAudio.newBuilder().setContent(audioBytes).build();

            OperationFuture<LongRunningRecognizeResponse, LongRunningRecognizeMetadata> response =
                    speech.longRunningRecognizeAsync(recConfig, recognitionAudio);

            while (!response.isDone()) {
                System.out.println("Waiting for response...");
                Thread.sleep(10000);
            }

            List<SpeechRecognitionResult> results = response.get().getResultsList();

            for (SpeechRecognitionResult result : results) {
                // There can be several alternative transcripts for a given chunk of speech. Just use the
                // first (most likely) one here.
                SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
                System.out.printf("Transcription: %s%n", alternative.getTranscript());
            }

        } catch (InterruptedException | ExecutionException e) {
            e.printStackTrace();
        }
Usama
  • 49
  • 2
  • 10
  • 1
    Can you try the troubleshooting steps given in this GCP [doc](https://cloud.google.com/speech-to-text/docs/troubleshooting#returns_an_empty_response)? The zero length output could be because the audio file encoding didn't match the one specified in `RecognitionConfig`. Can you also provide more details on the conversion you are performing with VLC media player? – Kabilan Mohanraj Mar 12 '22 at 06:55
  • Thanks. I will look for the encoding. For vlc I am using the procedure. First I import the file in VLC (convert/save option). Then I simply convert change the profile to Audio - CD which is used to convert to .wav file. Then just simply save the file. I adds some information to the metadata of the file I think because when I save the file the file size is different as compared to the one that I downloaded. The downloaded file size was 43 kb and the file size after converting it with vlc is around 900 kb. And its a 9 to 10 sec long audio file. – Usama Mar 12 '22 at 07:41
  • 2
    If possible, can you share a sample file that you downloaded from Twilio (before conversion using VLC) so that I can test it on my side? – Kabilan Mohanraj Mar 12 '22 at 08:26
  • Here is the link https://drive.google.com/file/d/1JyJqZ7IT3ippjgWaaC_MuG6yeoJkzjzS/view?usp=sharing – Usama Mar 14 '22 at 04:54
  • 1
    I was able to reproduce the empty output. The issue seems to be with the audio file downloaded from Twilio. Can you try downloading the file from Twilio again but with the extension `.opus` instead of `.wav`? The downloaded file name would be `mp.opus`. Please share that file also. – Kabilan Mohanraj Mar 15 '22 at 08:08
  • 1
    I converted the `.wav` file to `.opus` and for this result: `Transcription: hello hello hello hello` which is the expected output. So, there could be an encoding issue with the audio file. – Kabilan Mohanraj Mar 15 '22 at 08:10
  • Thank you for the suggestion. But the problem is when I investigated the downloaded file so what is hapenning is that I am able to change the extension of file to wav using java but when I check the file and its data in an online software it shows that the file type is still .mka. Since google speech to text does not support .mka files I think that is causing this issue. – Usama Mar 16 '22 at 04:37
  • 2
    If you [take the URL of the recording from Twilio and add a `.mp3` extension to the URL, you can download an `audio/mpeg` file](https://www.twilio.com/docs/voice/api/recording#mp3). Would that work better with the API? – philnash Mar 16 '22 at 04:52
  • 1
    @Usama You are right. When I did an `ffprobe`, the encoding showed `matroska` which is attributed to .mka. Can you try @philnash's suggestion? – Kabilan Mohanraj Mar 16 '22 at 06:30
  • 1
    Thanks @philnash I will try it and let you know – Usama Mar 17 '22 at 07:26
  • @Usama you can take a look at my answer and let me know if it helps. – Kabilan Mohanraj Mar 18 '22 at 15:08
  • @Usama If my answer addressed your requirement, consider upvoting and accepting. If not, let me know so that the answer can be improved. Accepting an answer will help other community members with their research as well :) – Kabilan Mohanraj Mar 21 '22 at 06:27

1 Answers1

1

As @philnash has suggested, by appending a .mp3 extension to the recording URL, the MP3 version of the recording can be downloaded from Twilio. The same applies to the '.wav' extension as well.

InputStream in = new URL(jsonObject.get(“redirect_to”).toString()+”.mp3”).openStream(); // or “.wav”
Files.copy(in, Paths.get(“src/main/resources/mp.wav”), StandardCopyOption.REPLACE_EXISTING);

I tested this out with a sample Twilio recording and the ffprobe results are below.

Downloaded .wav file

Input #0, **wav**, from 'from-twilio-change-extension.wav':
  Duration: 00:00:14.60, bitrate: 128 kb/s
    Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 8000 Hz, 1 channels, s16, 128 kb/s

Downloaded .mp3 file

Input #0, **mp3**, from 'from-twilio-change-extension.mp3':
  Duration: 00:00:14.68, start: 0.000000, bitrate: 32 kb/s
    Stream #0:0: Audio: mp3, 22050 Hz, mono, fltp, 32 kb/s

As for audio encodings supported by the Speech-to-Text API, both WAV and MP3 are supported but MP3 is a Beta feature available only in the version v1p1beta1. So, the client library imports will look like com.google.cloud.speech.v1p1beta1.Packages.... The audio encoding in RecognitionConfig has to be changed according to the encoding of the audio file used. For a .wav file, RecognitionConfig.AudioEncoding.LINEAR16 has to be used, and for a .mp3 file, RecognitionConfig.AudioEncoding.MP3 has to be used.


An alternative would be to use the FFMPEG tool to convert audio files into one of the recognized codecs by Speech-to-Text. More information about usage of the tool can be found here. In your scenario, the .mka to .wav/.mp3 conversion can be done from the Java code as shown below.

String[] ffmpegCommand = {"ffmpeg", "-i", "/full/path/to/inputFile.mka", "/full/path/to/outputFile.wav"};

ProcessBuilder pb = new ProcessBuilder(ffmpegCommand);
pb.inheritIO();
pb.start();
Kabilan Mohanraj
  • 1,856
  • 1
  • 7
  • 17