Is Speech-to-Text-to-Translation an Impossible Dream?

Question

Theoretically, one could use a laptop's or tablet's or phone's microphone to capture spoken words, convert that to words on the screen and then, by accessing an API such as google translate, see "a" (not "the" - hardly ever, anyway) rough "draft" of a translation of those words (say, from English to Spanish or from Spanish to English).

I was thinking this would be useful in a courtroom - as a sort of "hands-free memo pad" for court interpreters.

Theoretically simple, but is it feasible? I see several potential problems:

The software would have to be told which is the target language and which is the source language. Otherwise, there might be a delay and sometimes it would even draw a wrong conclusion, if the device was left to its own devices (auto-detect).

Background noises and voices would have to be filtered out.

The translation (attempt) would only be valid once the speaker had finished their sentence - and how would the software know that? By length of the pauses? Some people pause within a sentence for a long time; some people barely pause between sentences, so...how would that work?

People not speaking clearly, or in hard-to-understand accents.

And this is not even mentioning (except here, obliquely) that context is often misconstrued by the robot underlord translators.

My intuition is that if Abraham Lincoln and Martin Luther King were speaking at the same time (which, even in a courtroom, does happen at times), the software would come up with something like this:

For score and seven years ago I am happy to join with you to day. Our fathers brought fourth on this continent, a new nation, in what will go down in history as the greatest conceived in Liberty, and. Dedicated to the perspiration that demonstration for freedom in all men are created equal. The history of our nation.

...and then be translated something like so:

Por puntuación y hace siete años que estoy encantado de unirme a ustedes hoy. Nuestros padres trajeron cuarto en este continente, una nueva nación, en lo que va a pasar a la historia como el mayor concebida en la libertad, y. Dedicada a la transpiración que la demostración por la libertad en todos los hombres son creados iguales. La historia de nuestra nación.

What I'm saying, I guess, is that humans "rock" when it comes to this sort of thing - at least compared to machines (software) in their current degree of sophistication, but do we, or will we, "rock" enough to overcome this problem? Is there a way to surmount these hurdles, at least to a sufficient extent for such a program to be worth the trouble to use? Perfection would be unattainable; matching human skill would also be, I believe, an unreachable goal, especially because of the context factor. Nevertheless: can Speech-to-Text-to-Context-to-Translation be done even relatively well and, if so, how?

The [Google Android Translate App](https://play.google.com/store/apps/details?id=com.google.android.apps.translate&hl=en) will already partly do this actually. Its made for words or phrases, not dictation. But its totally feasible — Icemanind, Jun 10 '15 at 22:02
I checked that out, but based on the comments, my fears/expectations about it seem to be confirmed. Should we be proud of our human superiority, or let down because our brains have yet to figure out how our ears and eyes work? — B. Clay Shannon-B. Crow Raven, Jun 11 '15 at 15:57

score 1 · Accepted Answer · answered Jun 10 '15 at 21:45

1

I believe it's possible and it can be done relatively well:

the device should be able to understand the context partially based on the data given from all kinds of sensors and memory, these would need to be finely tuned to give a good result, but isn't that something that people actually do all the time? We evaluate the context based on what we see, feel, where we are; what we've seen, what we felt and where we've been - a smart device should be able reproduce that
the device should be able to guess where the sentence ends/starts based on everything it knows about given language - people do the same,

If the device would have the same sensors, knowledge and memory that people do then it could theoretically do the same.

Even a blink of an eye can give a lot of context, I think it all boils down to the complexity and range of data the device accepts and uses to translate the text correctly. The more it knows, the better it is.

answered Jun 10 '15 at 21:45

Emil A.

3,387
4
29
46

Yes, the "knowledge" part is the rub. People are way more advanced than software is. Software epically fails at sight, and my guess is that it's pretty bad at hearing, too. What I mean is, it doesn't *understand* what it sees, and it doesn't *understand* what it hears. I'd be glad to be proven wrong, but I'm not holding my breath (blue is not my favorite color). – B. Clay Shannon-B. Crow Raven Jun 10 '15 at 21:49
Full understanding is only capable with a human brain directly at work; can a machine detect when humor is being used, when a person is quoting a famous line from a book or movie, or referencing a shared experience between the interlocutors. The list goes on. Machines can replace humans in very specific and simple operations, but where true complex cognition is involved: never. – B. Clay Shannon-B. Crow Raven Jun 11 '15 at 15:38
I don't really agree, but I'm marking it as the answer, anyway, as many probably agree and, besides, it's basically an opinion. – B. Clay Shannon-B. Crow Raven Jun 15 '15 at 22:51

Is Speech-to-Text-to-Translation an Impossible Dream?

1 Answers1