0

I have some azure function app that is supposed to be triggered by the blob. The idea is that every time something lands on the blob (those should only be excel files), the function runs and does some processing.

def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes"
                 f"Returns:{myblob.read}")

    #read new data and replace nan with none values
    data = pd.read_excel(myblob)
    data = data.where(pd.notnull(data), None)

#processing

This code worked for me during testing. However, I have just received an edited file from someone else and got Exception: XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\xef\xbb\xbfName,'

In the end, this is meant to be used by more people who will upload these files so I have to make sure it works every time. However, I do not see any pattern here. It works for one spreadsheet and fails for another.

TrebledJ
  • 8,713
  • 7
  • 26
  • 48
Grevioos
  • 355
  • 5
  • 30

2 Answers2

3

According to the pandas.read_excel offical document, as below, you can not use myblob: func.InputStream as its parameter io, because of the struct of the InputStream class of myblob and io should be a blob url with sas token or xlrd book.

enter image description here

enter image description here

So my solution is to read the content of myblob by its read method and convert it to a xlrd Book via xlrd.open_workbook method with file_contents parameter.

Here is my sample code.

import logging

import azure.functions as func

import pandas as pd
import xlrd

def main(myblob: func.InputStream):
    logging.info(f"Python blob trigger function processed blob \n"
                 f"Name: {myblob.name}\n"
                 f"Blob Size: {myblob.length} bytes")
    book = xlrd.open_workbook(file_contents=myblob.read())
    df = pd.read_excel(book)
    logging.info(f"{df}")

My sample xlsx file is as below.

enter image description here

And my local.settings.json, function.json & requirements.txt contents as below.

local.settings.json

{
  "IsEncrypted": false,
  "Values": {
    "FUNCTIONS_WORKER_RUNTIME": "python",
    "AzureWebJobsStorage": "DefaultEndpointsProtocol=https;AccountName=<your account name>;AccountKey=<your account key>;EndpointSuffix=core.windows.net"
  }
}

function.json

{
  "scriptFile": "__init__.py",
  "bindings": [
    {
      "name": "myblob",
      "type": "blobTrigger",
      "direction": "in",
      "path": "<your container name>/{name}",
      "connection": "AzureWebJobsStorage"
    }
  ]
}

requirements.txt: just show my additional packages.

pandas==0.24.2
xlrd >= 1.0.0

It works. The results like:

enter image description here

Peter Pan
  • 23,476
  • 4
  • 25
  • 43
  • 1
    Just curious, after doing some processing on the data frame, how do you then write the data frame into an excel onto an OutputStream? – Max Jul 06 '20 at 10:03
0

Could be 3 reasons:

  • As the error message says, that is definitely not Excel .xls format. Open it with a text editor (e.g. Notepad) that won't take any notice of the (incorrect) .xls extension and see for yourself.

The error message relates to the BOF (Beginning of File) record of an XLS file.

  • The case when the file is already open by Excel. It generates the same error.

  • read_excel , When you used read_excel to read a csv file.

Hope ti helps.

Mohit Verma
  • 5,140
  • 2
  • 12
  • 27