0

I'm adding some domain-specific words to the prebuilt models and the words are being recognized correctly but they are not being capitalized as I have specified in the transcription (trained the model using audio + human labeling)

There's no reference in any of the documentation of how this is processed or how to prepare the training data, nor the amount of data necessary to make this possible.

How do you specify that a word should be capitalized using the Azure Cognitive Service Speech Studio?

KarthikBhyresh-MT
  • 4,560
  • 2
  • 5
  • 12
Sampaio
  • 98
  • 1
  • 10

1 Answers1

0

Checkout text normalization options. Text normalization is an ability to modify how the speech engine normalizes text. For example, when the user says, "I would like to order 2 4-piece chicken nuggets." It could be recognized as "two four piece" (default) or "2 four piece" (inverse text normalization, or ITN).

The following normalization rules are automatically applied to transcriptions:

  • Use lowercase letters.
  • Remove all punctuation except apostrophes within words.
  • Expand numbers into words/spoken form, such as dollar amounts.

To select different text normalization options, you will need to modify your integration code as below:

  renderWebChat({
    directLine: createDirectLine({
      secret: 'YOUR_DIRECT_LINE_SECRET'
    }),
    language: 'en-US',
    webSpeechPonyfillFactory: await createCognitiveServicesSpeechServicesPonyfillFactory({
      credentials: {
        region: 'YOUR_REGION',
        subscriptionKey: 'YOUR_SUBSCRIPTION_KEY'
      },
      textNormalization: 'itn'
    })
  }, document.getElementById('webchat'));

Supported text normalization options are "display" (default), "itn", "lexical", and "maskeditn".

Once you create projects in Speech Studio, reference the assets you create in your applications using the REST APIs. Pronunciation assessment parameters set EnableMiscue: true. And parameters you may be included in the query string of the REST request:

Parameter: format (set detailed) Specifies the result format. Accepted values are simple and detailed. Simple results include RecognitionStatus, DisplayText, Offset, and Duration. Detailed responses include four different representations of display text. The default setting is simple.

A typical response for detailed recognition: (service also expects audio data, which is not included in this sample.)

{
  "RecognitionStatus": "Success",
  "Offset": "1236645672289",
  "Duration": "1236645672289",
  "NBest": [
    {
      "Confidence": 0.9052885,
      "Display": "What's the weather like?",
      "ITN": "what's the weather like",
      "Lexical": "what's the weather like",
      "MaskedITN": "what's the weather like"
    },
    {
      "Confidence": 0.92459863,
      "Display": "what is the weather like",
      "ITN": "what is the weather like",
      "Lexical": "what is the weather like",
      "MaskedITN": "what is the weather like"
    }
  ]
}

While training using speech data, (Audio + human-labeled transcript) in pronunciation.txt add your domain-specific words as seen below. (I could help if you could share your words in specific)

enter image description here

enter image description here

Response parameters results are provided as JSON. Some of its fields you can refer are:

Parameter: DisplayText

The recognized text after capitalization, punctuation, inverse text normalization (conversion of spoken text to shorter forms, such as 200 for "two hundred" or "Dr. Smith" for "doctor smith"), and profanity masking. Present only on success. When using the detailed format, DisplayText is provided as Display for each result in the NBest list.

The object in the NBest list can include:

Parameter: ITN

The inverse-text-normalized ("canonical") form of the recognized text, with phone numbers, numbers, abbreviations ("doctor smith" to "dr smith"), and other transformations applied.

Parameter: Display

The display form of the recognized text, with punctuation and capitalization added. This parameter is the same as DisplayText provided when format is set to simple.

A typical response for recognition with pronunciation assessment:

{
  "RecognitionStatus": "Success",
  "Offset": "400000",
  "Duration": "11000000",
  "NBest": [
      {
        "Confidence" : "0.87",
        "Lexical" : "good morning",
        "ITN" : "good morning",
        "MaskedITN" : "good morning",
        "Display" : "Good morning.",
        "PronScore" : 84.4,
        "AccuracyScore" : 100.0,
        "FluencyScore" : 74.0,
        "CompletenessScore" : 100.0,
        "Words": [
            {
              "Word" : "Good",
              "AccuracyScore" : 100.0,
              "ErrorType" : "None",
              "Offset" : 500000,
              "Duration" : 2700000
            },
            {
              "Word" : "morning",
              "AccuracyScore" : 100.0,
              "ErrorType" : "None",
              "Offset" : 5300000,
              "Duration" : 900000
            }
        ]
      }
  ]
}

Refer: Speech-to-text REST API v3.0 , Evaluate and improve Custom Speech accuracy , Train and deploy a Custom Speech model and Text normalization for US English

KarthikBhyresh-MT
  • 4,560
  • 2
  • 5
  • 12