How do I extract transcript with multiple speakers from Google Video Intelligence API Speech Transcription JSON output using jq?

Question

I'm testing out Google Video Intelligence speech-to-text for transcribing podcast episodes with multiple speakers.

I've extracted an example and published that to a gist: output.json.

cat file.json | jq '.response.annotationResults[].speechTranscriptions[].alternatives[] | {startTime: .words[0].startTime, segment: .transcript }'

Above command will print out the startTime of each segment, along with the segment itself. jq-output.json

{
  "time": "6.400s",
  "segment": "Hi, my name is Melinda Smith from Noble works. ...snip"
}
{
  "time": "30s",
  "segment": " Any Graham as a tool for personal and organizational ...snip"
}

What I'm aiming for is to have the speakerTagfor each segment included in my jq output.

This is where I'm stuck... to start, each array within .alternatives[] contains .transcript a string containing that segment, .confidence, and .words[] an array with each word of that segment and the time it was spoken.

That part of the JSON is how I get the first part of the output. Then, after it's gone through each segment of the transcript, at the bottom, it has one last .alternatives[] array, containing (again) each word from the entire transcript, one at a time, along with it's startTime, endTime, and speakerTag.

Here's a simplified example of what I mean:

speechTranscriptions:
  alternatives:
    transcript: "Example transcript segment"
    words:
      word: "Example"; startTime: 0s;
      word: "transcript"; startTime: 1s;
      word: "segment"; startTime: 2s;
  alternatives:
    transcript: "Another transcript segment"
    words:
      word: "Another"; startTime: 3s;
      word: "transcript"; startTime: 4s;
      word: "segment"; startTime: 5s;
  alternatives:
    words:
      word: "Example"; startTime: 0s; speakerTag: 1;
      word: "transcript"; startTime: 1s; speakerTag: 1;
      word: "segment"; startTime: 2s; speakerTag: 1;
      word: "Another"; startTime: 3s; speakerTag: 2;
      word: "transcript"; startTime: 4s; speakerTag: 2;
      word: "segment"; startTime: 5s; speakerTag: 2;

What I was thinking is to somehow go through the jq-output.json, and match each startTime with it's corresponding speakerTag found in the original Video Intelligence API output.

.response.annotationResults[].speechTranscriptions[].alternatives[] | ( if .words[].speakerTag then {time: .words[].startTime, speaker: .words[].speakerTag} else empty end)

I tried a few variations of this, with the idea to print out only start-time and speakerTag, then match the values in my next step. My problem was not understanding how to only print the startTime if it has a corresponding speakerTag.

As mentioned in the comments, it would be preferable to generate this result in one command, but I was just trying to break the problem down into parts I could attempt to understand.

Just include input/output examples and your failed attempts with a brief description of the problem. This is too long and broad — oguz ismail, May 10 '20 at 13:59
Unfortunately, given the size and complexity of the JSON gist, it's not clear what you mean by the "bottom in the final .alternatives[] array". Since you seem to have a good understanding of the structure of the original JSON, it should be easy for you to provide a very succinct piece of JSON that captures the essence of the problem. Please also note that based on your description, it would almost certainly be best to perform the entire task using just one invocation of jq. — peak, May 10 '20 at 21:50
good point @peak I added a simplified idea of it in the question, and a corresponding [trimmed-output.json](https://gist.github.com/infominer33/712c25d9aee4c493b05f4055ac2ffb23) gist, for clarification. — InfoMiner, May 11 '20 at 02:50

peak · Accepted Answer · 2020-05-11T07:14:01.427

2

My problem was not understanding how to only print the startTime if it has a corresponding speakerTag.

This could be accomplished using the filter:

.response.annotationResults[].speechTranscriptions[].alternatives[].words[]
 | select(.speakerTag)
 | {time: .startTime, speaker: .speakerTag}

So perhaps the following is a solution (or at least close to a solution) to the main problem:

.response.annotationResults[].speechTranscriptions[].alternatives[]
| (INDEX(.words[] | select(.speakerTag); .startTime) | map_values(.speakerTag)) as $dict
| {startTime: .words[0].startTime, segment: .transcript}
| . + {speaker: $dict[.startTime]}

edited May 11 '20 at 07:14

answered May 11 '20 at 06:36

peak

105,803
17
152
177

I'm marking that as correct, because I'm realizing I didn't even form my question properly. Basically I should ignore the sections which aren't divided by speaker, and just work with the second half which is. thanks again! – InfoMiner May 13 '20 at 10:15

How do I extract transcript with multiple speakers from Google Video Intelligence API Speech Transcription JSON output using jq?

1 Answers1