I'm trying to determine the fluency of a speaker using google speech (to text) API.
So far I have found that the API (betav1) can show the time taken to speak a word ( its starting time and ending time ).
And from Wikipedia,
Oral fluency or speaking fluency is a measurement both of production and reception of speech, as a fluent speaker must be able to understand and respond to others in conversation. Spoken language is typically characterized by seemingly non-fluent qualities (e.g., fragmentation, pauses, false starts, hesitation, repetition) because of ‘task stress.’ How orally fluent one is can therefore be understood in terms of perception, and whether these qualities of speech can be perceived as expected and natural (i.e., fluent) or unusual and problematic (i.e., non-fluent)
I can see we can get the pause, repetition, etc from the API of a word. But relative measurement can be difficult as I can't find any standard values.
Is there any proper approach to achieve this? Can anyone give a guideline to detect the fluency from google API (or any other valid approach using some sort of open-source speech libraries or external software)
It's completely fine if I am going in completely the wrong direction, just need a proper guideline to achieve the feature.