0

I'm getting mixed results with the Azure KeyPhrase API - sometimes successful (by that I mean 200 result) and others I'm getting 400 bad request. To test the service, I'm sending the contents from a Azure PDF on their NoSQL service.

The documentation says that each document may be upto 5k characters. So as to rule that out, (I started off with 5k) I'm limiting each to at most 1k characters.

How can I can get more info on what is the cause of the failure? I've already checked the Portal, but there's not much detail there.

I am using this endpoint: https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases

Some sample failures:

  • {"documents":[{"language":"en","id":1,"text":"David Chappell Understanding NoSQL on Microsoft Azure Sponsored by Microsoft Corporation Copyright © 2014 Chappell & Associates"}]}

  • {"documents":[{"language":"en","id":1,"text":"3 Relational technology has been the dominant approach to working with data for decades. Typically accessed using Structured Query Language (SQL), relational databases are incredibly useful. And as their popularity suggests, they can be applied in many different situations. But relational technology isn’t always the best approach. Suppose you need to work with very large amounts of data, for example, too much to store on a single machine. Scaling relational technology to work effectively across many servers (physical or virtual) can be challenging. Or suppose your application works with data that’s not a natural fit for relational systems, such as JavaScript Object Notation (JSON) documents. Shoehorning the data into relational tables is possible, but a storage technology expressly designed to work with this kind of information might be simpler. NoSQL technologies have been created to address problems like these. As the name suggests, the label encompasses a variety of storage"}]}

** added my quick/dirty poc code ***

List<string> sendRequest(object data)
    {
        string url = "https://eastus.api.cognitive.microsoft.com/text/analytics/v2.0/keyPhrases";
        string key = "api-code-here";
        string hdr = "Ocp-Apim-Subscription-Key";
        var wc = new WebClient();
        wc.Headers.Add(hdr, key);
        wc.Headers.Add(HttpRequestHeader.ContentType, "application/json");

        TextAnalyticsResult results = null;

        string json = JsonConvert.SerializeObject(data);
        try
        {
            var bytes = Encoding.Default.GetBytes(json);
            var d2 = wc.UploadData(url, bytes);
            var dataString = Encoding.Default.GetString(d2);
            results = JsonConvert.DeserializeObject<TextAnalyticsResult>(dataString);                
        }
        catch (Exception ex)
        {
            var s = ex.Message;
        }
        System.Threading.Thread.Sleep(125);

        if (results != null && results.documents != null)
            return results.documents.SelectMany(x => x.keyPhrases).ToList();
        else
            return new List<string>();
    }

Called by:

foreach (var k in vals)
        {
            data.documents.Clear();
            int countSpaces = k.Count(Char.IsWhiteSpace);
            if (countSpaces > 3)
            {
                if (k.Length > maxLen)
                {
                    var v = k;
                    while (v.Length > maxLen)
                    {
                        var tmp = v.Substring(0, maxLen);
                        var idx = tmp.LastIndexOf(" ");
                        tmp = tmp.Substring(0, idx).Trim();
                        data.documents.Add(new
                        {
                            language = "en",
                            id = data.documents.Count() + 1,
                            text = tmp
                        });
                        v = v.Substring(idx + 1).Trim();

                        phrases.AddRange(sendRequest(data));
                        data.documents.Clear();
                    }

                    data.documents.Add(new
                    {
                        language = "en",
                        id = data.documents.Count() + 1,
                        text = v
                    });
                    phrases.AddRange(sendRequest(data));
                    data.documents.Clear();
                }
                else
                {
                    data.documents.Add(new
                    {
                        language = "en",
                        id = 1,
                        text = k
                    });

                    phrases.AddRange(sendRequest(data));
                    data.documents.Clear();
                };
            }             
        }
Maria Ines Parnisari
  • 16,584
  • 9
  • 85
  • 130

2 Answers2

3

I manually created some requests using the document samples that you indicated had errors and they were processed by the service correctly and returned key phrases. So an encoding issue looks likely.

In the future, you can also look at the inner error returned by the service. Usually you'll see some more details like in the response sample below.

{
  "code": "BadRequest",
  "message": "Invalid request",
  "innerError": {
    "code": "InvalidRequestContent",
    "message": "Request contains duplicated Ids. Make sure each document has a unique Id."
  }
}

Also, there is a .NET SDK for Text Analytics that can help simplify calling the service. https://github.com/Azure/azure-rest-api-specs/tree/current/specification/cognitiveservices/data-plane/TextAnalytics

  • I don't see that information in the WebException object; StatusDescription is just "Bad Request". I would love to get the error description so that I can fix it on my side. As mentioned in the other post - if I submit individually, no problem - only when it's a batch of multiple documents. I verified that I have unique ids for each document. – Carol AndorMarten Liebster Jan 03 '18 at 17:35
  • FYI - The exception is thrown on: var d2 = wc.UploadData(url, bytes); and is caught in my catch. – Carol AndorMarten Liebster Jan 03 '18 at 17:38
  • I'd suggest you confirm that your multi-document json request payload is valid using the CognitiveServices console here: https://eastus.dev.cognitive.microsoft.com/docs/services/TextAnalytics.V2.0/operations/56f30ceeeda5650db055a3c6/console – Brian Smith - MSFT Jan 04 '18 at 00:34
1

Try changing this line

var bytes = Encoding.Default.GetBytes(json);

to

var bytes = Encoding.UTF8.GetBytes(json);
Maria Ines Parnisari
  • 16,584
  • 9
  • 85
  • 130
  • That change helped. The PDF I am processing has 15 paragraphs - if I send them in all 1 request, as 15 documents, each with less than 5k characters, it returns a 400 again. However, If I send as 15 separate requests (1 document/request) each request is processed ok. – Carol AndorMarten Liebster Jan 03 '18 at 15:04
  • Apparently there was an issue with my id value: was adding it this way: **id = data.documents.Count() + 1**. When changed to **id = rnd.Next(1, 1000) + data.documents.Count() + 1** the batch submission worked ok. I am simply looping through a list to send them, no concurrent processing. – Carol AndorMarten Liebster Jan 03 '18 at 17:45