0

I am using azure-ai-textanalytics(version 5.2.7) library for detecting PII in some text content that I have. From the azure documentation the maximum chars allowed while using asynchronous processing is 125,000 chars. https://learn.microsoft.com/en-us/azure/cognitive-services/language-service/concepts/data-limits enter image description here

Using the azure library below is how I am constructing the asynchronous client:

 private static TextAnalyticsAsyncClient createTextClient() {
        if (textAnalyticsClient == null) {
            textAnalyticsClient = new TextAnalyticsClientBuilder()
                    .credential(new AzureKeyCredential(AzureKeyVaultConnector.readKeyValue("languageResourceKey")))
                    .endpoint(AzureKeyVaultConnector.readKeyValue("languageResourceEndPoint"))
                    .buildAsyncClient();
        }
        return textAnalyticsClient;
    }

And I submit the documents to process using the below line:

 RecognizePiiEntitiesResultCollection piiEntityCollection = createTextClient().recognizePiiEntitiesBatch(documents,"en",requestOptions).block();

When I test with a string which is around 7000 chars, I get below error:

A document within the request was too large to be processed. Limit document size to: 5120 text elements. For additional details on the data limitations see https://aka.ms/text-analytics-data-limits

Why is it still trying to limit my max char size allowed to 5120? Shouldn't it be 125,000 since I am using async client? Any help is appreciated.

I would love to use azure-ai-textanalytics library and achieve this and not make direct http calls (Without using azure library).

acsam
  • 63
  • 1
  • 8

1 Answers1

1

Based on the scenario you have given, I have reproduced the code and output using this sample async request code provided in documentation.

The issue you're encountering is because the 125,000 character limit is for the total number of characters across all submitted documents in an asynchronous request, not for a single document. The maximum characters allowed for a single document in asynchronous processing is still 5,120 characters .

For PII detection you can request limit is 5 Documents as per documentation shared by you. So, for PII detection the total number of character limit per request is : 25600

For example, if your request contain 2 documents (5120 characters ) and 3 documents () The output will produce results for 2 documents (Length under limit).

Output: https://i.imgur.com/OUmrwvk.png

All 5 documents must be under 5120 character limit to get complete result.

To process a larger document (larger then 5120 characters), you can break it into smaller chunks of text before sending them to the API. Below is a sample code snippet to create list of documents with single larger document/Text:

In Python,


 def  split_string(string):
 #Splits a string into multiple strings, each of which is no more than 5120 characters long.
 strings = []
 for  i  in  range(0, len(string), 5120):
     strings.append(string[i:min(len(string), i + 5120)]) 
 return  strings

In Java,

// Split the string into multiple strings, each of which is no more than 5120 characters long. 
List<String> strings = new List<String>(); 
for (int i = 0; i < originalString.length(); i += 5120) { 
strings.add(originalString.substring(i, Math.min(originalString.length(), i + 5120))); }
RishabhM
  • 525
  • 1
  • 5
  • Thank you @RishabhM. I am confused because if I use direct http calls(following this link: https://learn.microsoft.com/en-us/rest/api/language/2022-05-01/text-analysis-runtime/submit-job?tabs=HTTP#piitaskparameters) to REST endpoint : https://ResourceName.cognitiveservices.azure.com/language/analyze-text/jobs/d18d1d1f-d3b9-493c-a214-f6bd5030edf8?api-version=2022-05-01 (followed by GET call to get job results), with the same document it extracts PII and gives me results. I am looking for a way to use the azure-ai-textanalytics library and not have HTTP calls directly in my implementation. – acsam Jun 08 '23 at 14:23
  • I had considered splitting the document as well, but the thought of split happening exactly on PII data is scary. Even though I make sure that the split happens only after a complete word, if the split happens between 'firstname lastname' for example(or address), I will loose meaningful data. I tested this and extracted information on split documents was not satisfactory(When address was split, the model did not recognize it as address at all). – acsam Jun 08 '23 at 14:36
  • I understand you concern but based on your scenario, splitting is the way to go if you want to use azure-ai-textanalytics library. You can try various methods such as sentence boundary detection to maintain the context. – RishabhM Jun 09 '23 at 04:22