1

Has anyone tried using Alibaba Cloud SDKs to create a real-time video call apps? As I ask the support they said video call service is not available for international Alibaba cloud but the Chinese one does. They also mentioned that I could try making using their SDKs. I'm asking them about the mentioned SDKs right now, what are those SDKs.

If there's anyone who has experience in the related field or technologies, please help me figure out whether is it worth making it using Alibaba cloud or go with other cloud service, since Alibaba cloud does not support multi-clouds.

It would be much appreciated thanks!!!

Related document from Alibaba based in China:

Speech to text from audio data in RTC [Windows]

Speech to text from audio data in RTC [Android]

Real-time speech recognition

Alibaba Cloud Machine Translation

1 Answers1

1

The good news: there are many potential providers and options for cobbling something together.

The bad news: this problem is not easy, and the products from the top research and product teams are not very robust.

You can find the list of all self-serve machine translation API providers at modelfront.com/compare. Most of those same providers also offer speech recognition APIs, and speech recognition is also available on many devices.

But, depending on your scenario, you may be better off using a speech-to-speech approach (vs. glueing together multiple systems), and even a local model (vs. an external API), for three reasons: quality and latency, and the interaction of the two - which is that users don't want to wait for the full sentence, but also don't like translated text flickering as new words come.

If you search r/machinetranslation for speech OR simultaneous OR interpreting, you'll find:

  • a launch announcement for "interpreter mode" from Google Assistant

  • a Baidu announcement on a quality improvement

  • two articles from Mattia di Gangi at FBK

  • the flickering paper from Google (Re-translation versus Streaming for Simultaneous Translation)

  • the Translatron article and paper from Google

  • a landscape survey from Apple

  • the NeurST toolkit GitHub repo from ByteDance (TikTok)

There was a keynote from Baidu Research on this at WMT 2019, and recently a bit more on flickering from Google, but both focussed on their own products, not offerings for external developers.

Adam Bittlingmayer
  • 1,169
  • 9
  • 22
  • Thanks for the descriptive answer, I am now deciding to go with Open Source Media Server like Jitsi, and combine it with a end to end translation service like Media Translation from google or Speech translation from Microsoft. I'm leaning towards Microsoft speech translation. I chose Microsoft as I need translation mainly on Chinese, Japanese and English. But I don't know whether it is feasible using it with **Jitsi (jigasi)**, combining them with Microsoft Speech Translation, to create real-time video call application for Web and android. I would appreciate your thoughts on this. Thank you. – Pisethpanha Chhean Feb 01 '21 at 08:01
  • Bad new, Microsoft Azure is not available in my country (Cambodia). So I can't test the end to end speech translation service out. Do you know any other services similar? The main languages are Chinese, Japanese and English. Thank you in advance – Pisethpanha Chhean Feb 02 '21 at 10:01
  • 1
    My advice would be to just decouple your business and/or account location from your physical location, from the perspective of the IaaS providers. Get a credit card or whatever in an region they support, with the help of a friend or whatever you need to do, or you're going to hit this again and again. – Adam Bittlingmayer Feb 03 '21 at 12:06