0

I am trying to use Microsoft Azure Form Recognizer API to upload Invoice pdf and get table info inside it.

I was able to make a successful POST request.

But not able to train the model and getting an error that 'No valid blobs found in the specified Azure blob container. Please conform to the document format/size/page/dimensions requirements.'.

But I have more than 5 files in a blob storage container.

I have also provided the shared key for the blob container. You can find the code I have written and the error attached.

"""
Created on Thu Feb 20 16:22:41 2020

@author: welcome
"""

########## Python Form Recognizer Labeled Async Train #############
import json
import time
from requests import get, post

# Endpoint URL
endpoint = r"https://sctesting.cognitiveservices.azure.com"
post_url = endpoint + r"/formrecognizer/v2.0-preview/custom/models"
print(post_url)
source = '<source url from blob storage>'
prefix = "name of the folder"
includeSubFolders = False
useLabelFile = False

headers = {
    # Request headers
    'Content-Type': 'application/json',
    'Ocp-Apim-Subscription-Key': '<key>',
}

body =  {
    "source": source,
    "sourceFilter": {
        "prefix": prefix,
        "includeSubFolders": includeSubFolders
    },
    "useLabelFile": useLabelFile
}

try:
    resp = post(url = post_url, json = body, headers = headers)
    if resp.status_code != 201:
        print("POST model failed (%s):\n%s" % (resp.status_code, json.dumps(resp.json())))
        quit()
    print("POST model succeeded:\n%s" % resp.headers)
    get_url = resp.headers["location"]
except Exception as e:
    print("POST model failed:\n%s" % str(e))
    quit() 


n_tries = 15
n_try = 0
wait_sec = 3
max_wait_sec = 60
while n_try < n_tries:
    try:
        resp = get(url = get_url, headers = headers)
        resp_json = resp.json()
        if resp.status_code != 200:
            print("GET model failed (%s):\n%s" % (resp.status_code, json.dumps(resp_json)))
            quit()
        model_status = resp_json["modelInfo"]["status"]
        if model_status == "ready":
            print("Training succeeded:\n%s" % json.dumps(resp_json))
            quit()
        if model_status == "invalid":
            print("Training failed. Model is invalid:\n%s" % json.dumps(resp_json))
            quit()
        # Training still running. Wait and retry.
        time.sleep(wait_sec)
        n_try += 1
        wait_sec = min(2*wait_sec, max_wait_sec)     
    except Exception as e:
        msg = "GET model failed:\n%s" % str(e)
        print(msg)
        quit()
print("Train operation did not complete within the allocated time.")

output got in Anaconda prompt by running the above code

 POST model succeeded:
{'Content-Length': '0', 'Location': 'https://sctesting.cognitiveservices.azure.com/formrecognizer/v2.0-preview/custom/models/30b7d99b-fc57-466d-a59b-c0d9738c03ac', 'x-envoy-upstream-service-time': '379', 'apim-request-id': '18cbec13-8129-45de-8685-83554e8b35d4', 'Strict-Transport-Security': 'max-age=31536000; includeSubDomains; preload', 'x-content-type-options': 'nosniff', 'Date': 'Thu, 20 Feb 2020 19:35:47 GMT'}
Training failed. Model is invalid:
{"modelInfo": {"modelId": "30b7d99b-fc57-466d-a59b-c0d9738c03ac", "status": "invalid", "createdDateTime": "2020-02-20T19:35:48Z", "lastUpdatedDateTime": "2020-02-20T19:35:50Z"}, "trainResult": {"trainingDocuments": [], "errors": [{"code": "2014", "message": "No valid blobs found in the specified Azure blob container. Please conform to the document format/size/page/dimensions requirements."}]}}      
Adith
  • 1
  • 2
  • Have you specified source variable? I am assuming you removed from you code snippet just for security reason. right? Otherwise, @sebastian answer seems to be valid answer. – Ahmadreza May 26 '20 at 01:40

4 Answers4

0

if you use the from recognizer labeling tool to do the same thing, would that work? have you put the files in the root directory of the Azure blob, or in a sub directory?

Xin Zou
  • 312
  • 1
  • 9
  • in a sub-directory @Xin Zou – Adith Feb 21 '20 at 05:57
  • if you have files in a sub-directory, but your code has this line: `includeSubFolders = False` you need to set it to True, and make sure the prefix is correct. – Xin Zou Feb 21 '20 at 22:21
0

Make sure that the files in your blob storage container fit the requirements here: https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/overview#custom-model

If your files look fine, also check what kind of SAS token you are using. The error message you have can occur if you are using a policy defined SAS token, in which case, try switching to a SAS token with explicit permissions as detailed here: https://stackoverflow.com/a/60235222/12822344

Lynsey
  • 41
  • 1
0

You didnt specify a source. You need to generate a Shared Access Signature (SAS) when you're in the menu of the selected storage account. If you have a container in that storage account, you'll need to include your container name in the URL. EX. If you have a container named "train": "www....windows.net/?sv=....." ---> "www....windows.net/train?sv=......". Otherwise you can try to use the "prefix" string, but I found it buggy.

Also, you have not included your subscription key.

https://learn.microsoft.com/en-us/azure/cognitive-services/form-recognizer/quickstarts/python-train-extract

0

Try removing the Source Filter from the body. It should work.

Stevy
  • 3,228
  • 7
  • 22
  • 38