I'm working on developing a tool that can automatically join a Google Meet session, record the audio, and generate real-time notes that are aware of who is speaking. The tool should be able to identify speakers and accurately associate their spoken words with their name.
Is there an official Google API available for this purpose, or are there any other recommended approaches to achieve this functionality?
I attempted to implement this functionality using Google Cloud Speech-to-Text, but I found that the service requires the meeting to be pre-recorded before it can transcribe the audio. Additionally, the accuracy of speaker recognition using this service was not satisfactory as we can't get the actual speaker names. I have tried to scrap the google meet captions but it does not seems to be a reliable solution. I want it like the webkitSpeechRecognition but with the identification of speakers.