How to use grammar text editors for Speech-to-Text documents in JavaScript / NodeJS

Question

I'm relatively new to programming (1 year working as an intern, and finishing grad), and might be biting more than I can chew, this is also my first interaction here (yey) So let me explain the problem thoroughly:

I'm currently using Google Speech to Text API to get transcribed documents of my interviews, they are conducted in English, Spanish, and Portuguese.

The English translation and transcribing are perfect, I don't need to update or fix anything. But, the Spanish and Portuguese transcribing is lacking punctuation and speaker diarization/labeling, since those are not available on Google Speech2Text API, and some grammatical errors, such as repetition and tangling speakers words and utterances (Especially when it comes to questions)

I'm using Javascript and NodeJS and I'm not sure which packages, methods, APIs, or libraries I should use.

So to fix this I landed on 2 concepts that I would like to integrate into my service:

Correctness - Eliminates grammar, spelling, and punctuation mistakes and ensures word choices sound natural and fluent.

Clarity - Makes every sentence concise and easy to follow and rewrites hard-to-read sentences.

Without losing some of the keywords for qualitative/quantitative research.

I got some answers and resolutions, but I don't know which one I use. Implement an ABNF grammar, implement a pre-trained NLP for those languages, or utilize other paid services (that would generate more expenses and might not be great for my growth as a developer). Those terms and concepts are completely new to me and I got super lost on documentation trying to determine the best way to deal with it.

That being said:

How do I make use of ABNF / NLP?
There is some way to make speaker labeling without having access to the audio channels or using ABNF / EABNF, NPL, Document AI, or any logical grammatical timing?
How to standardize the text cluster/segmentation to be able to recognize punctuation, using grammar or audio frequency offset?
What packages would help me?
What other ways could/should I resolve this?

I know this sounds like overkill (and it probably is), but I got really interested, it would help me a lot in the company that I'm working in, and get somewhat of some experience in problem-solving.

I'd be grateful if anyone could help sort any of these problems out.

When the conversation/interview is between 2 people, and back and forth dialog, question, and answer, this is not a problem at all, is just a matter of:

const wordsInfo = result.alternatives[0].words;

wordsInfo.forEach(a =>
  console.log(` word: ${a.word}, speakerTag: ${a.speakerTag}`)
);

It is a solution, but not for all problems, and I'm still getting used to the college's architecture. It's a project with no updates since a long time ago, and I'm probably using an outdated version of Google speech2text API.

So I've found a way to get the result time and clustering using xslx and docx:

formatResultTime(resultEndTime) {

    const resultInSeconds = parseFloat(resultEndTime.replace('s', ''));
    let second = String(Math.trunc(resultInSeconds % 60));
    second = second.length === 1 ? `0${second}` : second;
    let minute = String(Math.trunc(resultInSeconds / 60));
    minute = minute.length === 1 ? `0${minute}` : minute;
    let hour = String(Math.trunc(minute / 60));
    hour = hour.length === 1 ? `0${hour}` : hour;

    return `${hour}:${minute}:${second}`;
  }

  getTxts(results) {
    const txts = results.reduce((acc, { alts, endTime }) => {
      const txt = alts.reduce((paragraphs, alternative) => {
        if (alternative.transcript) {
          paragraphs.push(alternative.transcript);
        }
        return paragraphs;
      }, []);

      acc.push({
        text: text.join(' '),
      endTime: this.formatTime(endTime),
      });

      return acc;
    }, []);
    return texts;
  }

But company QA asks for a fix: word error/identification, separate paragraphs, and time stamps in a better way. Here are an example of my outputs:


"
[00:00:16]  Good morning, Anita.
[00:00:19]  Good morning, Lucas.
[00:00:21]  How's it going?
[00:00:31]  I have a feeling that something is going to happen today.[Should break here] Why do you think that? [Should break here*] I don't know. It's just my gut feeling.
[00:00:34]  Don't worry, it's going to be okay.
[00:00:36]  I hope so.
[00:00:43]                   // [Empty time stamp]
[00:00:54]  Good morning. How may I help you? [Should break here] I'm a guest calling from room 703. My TV remote is not working.
[00:00:57]  Could you please describe your problem in detail?
[00:01:12]  I haven't been able to use the control since last night, every time I want to change the channel. I have to run back and forth and press the, but this makes me very upset. Please get someone to fix it right away.
[00:01:20]  I'm sorry for the inconvenience. I will send the technician up to you right away.  [Should break here] Alright, thank you.
[00:01:34]  Excuse me. Hello, sir. How may I help you? [Should break here] I'm a guest of room 615. My room is right next to an elevator.
[00:01:42]  Yes, I remember. Is there something wrong last night? [Should break here] I kept hearing loud. [ Incorrect punctuation ] Talking nearby.
[00:01:55]  Not only that, but the sound of the elevator moving is also annoying me. Then I don't understand what's wrong with you. Out to the staff kept moving furniture all day.
[00:02:02]  The noise of all these things is very disturbing to me, makes it impossible to sleep.
[00:02:04]  I am sorry to hear that.
[00:02:23]  But yesterday, the staff of the hotel did not move the furniture. Maybe they were just moving the luggage for guests.[Should break here] I don't need to know. I want to change the room immediately. [Should break here] I'm so sorry for the bad experience that you went through, but there are no rooms available now.
"

And how it should be:

"
[00:00:16] [Lucas]: Good morning, Anita.
[00:00:19] [Anita]: Good morning, Lucas.
"

OR

"
[00:00:16] [Speaker0]: Good morning, Anita.
[00:00:19] [Speaker 1]: Good morning, Lucas.
"

How do I make it follow a pattern? (These problems usually happen in Portuguese and Spanish) I speak all 3 languages somewhat, so if you want to answer in any of them it'd be okay, of course, the general preference for the community is English.

ADDITIONAL INFORMATION, browsing through I've found some Grammarly threads that might be useful if given the necessary expertise (not my case) here they are: Link to Grammarly site !

The 14 Punctuation Marks

Comma Period/Full Stop Colon Ellipsis Semicolon Apostrophe Hyphen Dash Quotation Marks Question Mark Exclamation Point Slash Parentheses and Brackets

Punctuation Rules

Comma Splice, Comma Before And, Comma Before Too, Comma After Question Mark, Commas in Dates Oxford Comma, Quotation Marks in Titles, Quotation Marks Around a Word, Quotation Marks in Dialogue, Capitalization in Quotes, Semicolon vs. Colon vs. Dash Capitalization After Colons

Grammar and Mechanics

Grammar Checker, Spell Check, Parts of Speech, Contractions, Verb Tenses, Subject-Verb Agreement, Syntax, Clauses, Sentence Fragments, Run-On Sentences, Capitalization Rules, Abbreviations, Common Grammar Mistakes,

Any package, API libraries, for text RPC, real-time transcribing, or text result text-checking

Edit 1:

I've found a way to format the text in a better way and it goes through updating the version of Google-Speech-To-Text to V1 and making use of the channel tags and channel labels of the video/audio file, so the logic looks like this :


    formatResultTime(resultEndTime) {
  const resultInSeconds = parseFloat(resultEndTime.replace('s', ''));
  let second = String(Math.trunc(resultInSeconds % 60));
  second = second.length === 1 ? `0${second}` : second;
  let minute = String(Math.trunc(resultInSeconds / 60));
  minute = minute.length === 1 ? `0${minute}` : minute;
  let hour = String(Math.trunc(minute / 60));
  hour = hour.length === 1 ? `0${hour}` : hour;

  return `${hour}:${minute}:${second}`;
}

extractTexts(results) {
  const resultsWithBestAlternative = results
    .filter((result) => result.channelTag && result.resultEndTime)
    .reduce((acc, curr) => {
      if (
        acc.find(
          ({
            alternatives
          }) =>
          alternatives.transcript === curr.alternatives[0].transcript
        )
      ) {
        return acc;
      }

      return [
        ...acc,
        {
          resultEndTime: curr.resultEndTime,
          channelTag: curr.channelTag,
          alternatives: curr.alternatives[0],
        },
      ];
    }, []);

  const texts = resultsWithBestAlternative.map((result) => {
    return {
      l1: `[${this.formatTime(result.resultEndTime)}] [P:${
          result.channelTag
        }]`,
      l2: `${result.alternatives.transcript}`,
    };
  });

  return texts;
}

And the output is great, but since I'm converting MP3/MP4 files into .wav files it can get the channels and labels mixed up and end up not breaking an utterance when it should, and repeating the same text for the 2 channels.

"
[00:00:36] [P:1]  Oi Marcia Dominica. Você é da onde?

[00:00:45] [P:1]  Eu sou da sillováquia. E você está aonde? Agora? Agora está aqui em São Paulo no Brasil.

[00:01:17] [P:1]  Em São Paulo que lugar de São Paulo São Paulo é grande, né? É São Paulo é grande cidade de São Paulo, hã. Você conhece Vila Olímpia? Ah sim. Aham. Eu moro aqui em Vila Olímpia. Ah é um outubro vai fazer 2 anos já e vamos lá e como é que apareceu o português na sua vida. Vamos Aqui começou. Não quando eu cheguei aqui eu já tinha base dá para fazer assim dá para falar assim já tinha base já tava entendendo um pouquinho.

[00:01:17] [P:2]  Em São Paulo que lugar de São Paulo São Paulo é grande, né? É São Paulo é grande cidade de São Paulo, hã. Você conhece Vila Olímpia? Ah sim. Aham. Eu moro aqui Vila, Olímpia. Ah é um outubro vai fazer 2 anos já e vamos lá e como é que apareceu o português na sua vida. Como que começou Não quando eu cheguei aqui eu já tinha base dá para fazer assim dá para falar assim já tinha base já tava entendendo um pouquinho.

[00:02:18] [P:2]  Hã, porque eh eh meu marido é brasileiro e a gente tem um filho Nathan, ele tem 8 anos agora então Eh, eh é muito bom e eu sempre Tava escutando o meu marido falar no português com ele. Não não tem muito tempo para falar porque se ele vai pras como trabalhando mas no fim de de dia sabe coisinhas pequenas, você quer comer? Eu tô com fome. Que que você fez essa perguntinha. Então já tinha a base, mas quando eu cheguei aqui eu falei nossa, eu não sei nada tem que aprender mais e eu achei não teve a pandemia na pandemia e então tudo as coisas eu queria entrar na escola, mas eh tava tudo fechado, eu falei nossa como eu vou aprender não tem amigos não tem trabalho no meu marido não fica em casa trabalhando o dia inteiro, não tem ninguém para praticar.
"

So I'm looking into FFMPEG for channel splitting the original media and converting it to a multichannel .wav file based on the number of speakers, I lack the knowledge of audio manipulation but I'm studying it as I post this.

and it looks like this:

// FFMPEG function .mp4 to .wav

await this.createVideoAndJobFolder(
          `${pathToTmp}${video}`,
          `${pathToTmp}${video}/${id}`
        );
      await writeFile(origin, Body); 

      `${pathToTmp}${videoOriginDotWav}`,
        spawnSync(
          'ffmpeg',
          [
            '-i',
            origin,
            '-f',
            'wav',
            `${pathToTmp}${videoOriginDotWav}`,
            '-y',
            '-hide_banner',
          ],
          {
            stdio: 'inherit',
          }
        );

      const mediaFileBuffer = await readFile(
        `${pathToTmp}${videoOriginDotWav}`
      );

      await this.s3Svc
        .putObject({
          Bucket: settings.AWS_BUCKET,
          Key: destination,
          ContentType: 'audio/x-wav',
          Body: Buffer.from(mediaFileBuffer, 'utf8'),
        })
        .promise();

      return videoOriginDotWav;
    } catch (e) {
      console.error('Error**', e.stack);
      return false;
    }
  }

If anyone can link me a way to separate audio waves and differentiate them and tune the paragraph and speaker recognition, (>35dB ?). I've seen plenty of content on audio isolation and frequency-based coding and think the key is to understand FFMPEG functions such as:

How to use grammar text editors for Speech-to-Text documents in JavaScript / NodeJS

0 Answers0