0

I've been using Google Cloud Video Intelligence for text detection. Now, I want to use it for speech transcription so I added SPEECH_TRANSCRIPTION feature to TEXT_DETECTION but the response only contains result for one feature, the last one.

const gcsUri = 'gs://path-to-the-video-on-gcs'
const request = {
  inputUri: gcsUri,
  features: ['TEXT_DETECTION', 'SPEECH_TRANSCRIPTION'],
};

// Detects text in a video
const [operation] = await video.annotateVideo(request);
const [operationResult] = await operation.promise();

const annotationResult = operationResult.annotationResults[0]
const textAnnotations  = annotationResult.textAnnotations
const speechTranscriptions  = annotationResult.speechTranscriptions

console.log(textAnnotations) // --> []
console.log(speechTranscriptions) // --> [{...}]

Is this a case where annotation is performed on only one feature at a time?

Chukwuma Nwaugha
  • 575
  • 8
  • 17

2 Answers2

1

Annotation will be performed for both features. Below is an example code.

const videoIntelligence = require('@google-cloud/video-intelligence');
const client = new videoIntelligence.VideoIntelligenceServiceClient();
const gcsUri = 'gs://cloud-samples-data/video/JaneGoodall.mp4';


async function analyzeVideoTranscript() {
const videoContext = {
 speechTranscriptionConfig: {
   languageCode: 'en-US',
   enableAutomaticPunctuation: true,
 },
};


const request = {
 inputUri: gcsUri,
 features: ['TEXT_DETECTION','SPEECH_TRANSCRIPTION'],
 videoContext: videoContext,
};


const [operation] = await client.annotateVideo(request);
const results = await operation.promise();
console.log('Waiting for operation to complete...');
// Gets annotations for video
console.log('Result------------------->');
console.log(results[0].annotationResults);


var i=1;
results[0].annotationResults.forEach(annotationResult=> {
   console.log("annotation result no: "+i+" =======================>")
   console.log("Speech : "+annotationResult.speechTranscriptions);       
   console.log("Text: "+annotationResult.textAnnotations);
   i++;
});


}
analyzeVideoTranscript();

N.B: What I have found is that annotationResult may not return the result in the same order of the declared features . You may want to change the code accordingly as per your need.

Edit:

You can check how many results you are getting by printing the results.annotationResults.length . You should have two annotation results for the individual features. All you need to do is to traverse the response.

Here is the output of the above code:

enter image description here

Output got converted to string as I have printed the result in the same line.

kiran mathew
  • 1,882
  • 1
  • 3
  • 10
  • It doesn't work for me. I got the following result { "texts": [], "transcriptions": [{...}] } – Chukwuma Nwaugha Feb 20 '23 at 18:04
  • 1
    Hi @ChukwumaNwaugha, I have updated my answer with output’s screenshots. I am not sure how you are not getting text annotation results. – kiran mathew Feb 21 '23 at 14:29
  • Hi Kiran Matthew, thanks for following up. And yes, you're correct. I was retrieving the first item in the array without considering that there could be other array items with data. And this was the culprit `const annotationResult = operationResult.annotationResults[0]` in the code. – Chukwuma Nwaugha Feb 22 '23 at 19:59
1

I think it has to do with the async call and the ... spread operator. I tested this with all the features to be sure and it worked for me.

const { VideoIntelligenceServiceClient } = require('@google-cloud/video- 
intelligence');
const path = require('path');

const gcsUri = 'gs://path/somefile';
const outputUri = `gs://optional-path-to-save-check-bucket}.json`;
const videoClient = new VideoIntelligenceServiceClient({
  keyFilename: '/path_to_local/key/used/to_test_this.json'
});

const transcriptConfig = {
    languageCode: "en-US",
    enableAutomaticPunctuation: true,
    enableSpeakerDiarization: true,
    enableWordConfidence: true,
    speechContexts: []
  };

  const videoContext = {
    speechTranscriptionConfig: transcriptConfig,
  };

//Threw in all features to check myself
const request = {
  inputUri: gcsUri,
  outputUri: outputUri,
  features: [ 'OBJECT_TRACKING',
  'LABEL_DETECTION',
  'SHOT_CHANGE_DETECTION',
  'TEXT_DETECTION',
  'FACE_DETECTION',
  'PERSON_DETECTION',
  'LOGO_RECOGNITION',
  'EXPLICIT_CONTENT_DETECTION',
  'SPEECH_TRANSCRIPTION'],
  videoContext: videoContext
};


async function detectTextAndSpeech() {
  // Detects text and speech in a video
  const [operation] = await videoClient.annotateVideo(request);
  const [operationResult] = await operation.promise();

  const textAnnotations = [];
  const speechTranscriptions = [];

  operationResult.annotationResults.forEach(annotationResult => {
    if (annotationResult.textAnnotations) {
      textAnnotations.push(...annotationResult.textAnnotations);
    }
    if (annotationResult.speechTranscriptions) {
      speechTranscriptions.push(...annotationResult.speechTranscriptions);
    }
  });

  console.log(textAnnotations);
  console.log(speechTranscriptions);
}

detectTextAndSpeech();
user1446988
  • 333
  • 1
  • 6
  • 22