Blob triggered function cannot read some excel files

Question

I have some azure function app that is supposed to be triggered by the blob. The idea is that every time something lands on the blob (those should only be excel files), the function runs and does some processing.

def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes"
                 f"Returns:{myblob.read}")

    #read new data and replace nan with none values
    data = pd.read_excel(myblob)
    data = data.where(pd.notnull(data), None)

#processing

This code worked for me during testing. However, I have just received an edited file from someone else and got Exception: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbfName,'

In the end, this is meant to be used by more people who will upload these files so I have to make sure it works every time. However, I do not see any pattern here. It works for one spreadsheet and fails for another.

score 3 · Accepted Answer · answered Jun 11 '19 at 10:53

According to the pandas.read_excel offical document, as below, you can not use myblob: func.InputStream as its parameter io, because of the struct of the InputStream class of myblob and io should be a blob url with sas token or xlrd book.

So my solution is to read the content of myblob by its read method and convert it to a xlrd Book via xlrd.open_workbook method with file_contents parameter.

Here is my sample code.

import logging

import azure.functions as func

import pandas as pd
import xlrd

def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes")
    book = xlrd.open_workbook(file_contents=myblob.read())
    df = pd.read_excel(book)
    logging.info(f"{df}")

My sample xlsx file is as below.

And my local.settings.json, function.json & requirements.txt contents as below.

local.settings.json

{
  "IsEncrypted": false,
  "Values": {
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "AzureWebJobsStorage": "DefaultEndpointsProtocol=https;AccountName=<your account name>;AccountKey=<your account key>;EndpointSuffix=core.windows.net"
  }
}

function.json

{
  "scriptFile": "__init__.py",
  "bindings": [
    {
      "name": "myblob",
      "type": "blobTrigger",
      "direction": "in",
      "path": "<your container name>/{name}",
      "connection": "AzureWebJobsStorage"
    }
  ]
}

requirements.txt: just show my additional packages.

pandas==0.24.2
xlrd >= 1.0.0

It works. The results like:

Just curious, after doing some processing on the data frame, how do you then write the data frame into an excel onto an OutputStream? — Max, Jul 06 '20 at 10:03

score 0 · Answer 2 · answered Jun 10 '19 at 05:28

Could be 3 reasons:

As the error message says, that is definitely not Excel .xls format. Open it with a text editor (e.g. Notepad) that won't take any notice of the (incorrect) .xls extension and see for yourself.

The error message relates to the BOF (Beginning of File) record of an XLS file.

The case when the file is already open by Excel. It generates the same error.
read_excel , When you used read_excel to read a csv file.

Hope ti helps.

Blob triggered function cannot read some excel files

2 Answers2