2

I am using Azure Synapse to query a large number of CSV files with the OPENROWSET command see here. The files are located on a Data Lake gen 2 connected to the Azure Synapse via a managed identity.

This is working fine when I am only querying a few files at a time, however when I increase the number of files which I am trying to query simultaneousally I am getting the following error:

Azure Synapse: Cannot bulk load because the file <file> could not be opened. Operating system error code 12(The access code is invalid.)

Here <file> is a different file each time I run the query. If I navigate to the file in the linked data view I can download and view the file. Also if I specify to run the query on the file mentioned previousally in an error it will work fine.

The code I am using to query the data lake is below:

SELECT
            Parsed.*
            FROM OPENROWSET
            (
                bulk '2021/*/**.log',
                maxerrors = 2147483647,
                data_source = 'analytics',
                format = 'csv',
                fieldterminator ='0x0b',
                fieldquote = '0x0b'
            ) WITH (doc nvarchar(max)) AS Rows
            CROSS APPLY OPENJSON(Rows.doc)
            WITH
                (
                    col1 NVARCHAR(100),
                    col2 NVARCHAR(100),
                    ...,
                    coln NVARCHAR(MAX)
                ) AS Parsed

Here the data source, analytics is a data source specified as follows:

    CREATE EXTERNAL DATA SOURCE analytics
        WITH
        (
            location = 'https://<url>.dfs.core.windows.net/analytics'
        )

I have tried specifying a large number for the MAXERRORS parameter for BULK in OPENROWSET as I don't mind if only a few files are missed in executing this query, however this only seems to work at the row level for errors and these errors are at the file level.

The query is running here on the built-in serverless pool.

Any ideas on how to get around this issue would be appreciated.

William Moore
  • 3,844
  • 3
  • 23
  • 41

2 Answers2

2

Your code is doing pass through authentication to storage for whatever AAD user is connected to Synapse serverless (which will fail if you are using a SQL login). To use the MSI to connect to storage you will need a database scoped credential and will need to reference it in the external data source as in this example.


-- Optional: Create MASTER KEY if not exists in database:
-- CREATE MASTER KEY ENCRYPTION BY PASSWORD = '<Very Strong Password>
CREATE DATABASE SCOPED CREDENTIAL SynapseIdentity
WITH IDENTITY = 'Managed Identity';
GO
CREATE EXTERNAL DATA SOURCE mysample
WITH (    LOCATION   = 'https://<storage_account>.dfs.core.windows.net/<container>/<path>',
          CREDENTIAL = SynapseIdentity
)

Also see the sections in that article about the firewall on your storage account if you have that locked down.

GregGalloway
  • 11,355
  • 3
  • 16
  • 47
2

Just adding a quick answer to this. After making the changes which Greg suggested, I was able to query a little more data - but still getting hit with the error code 12.

I spoke with Azure support and they advised me that the error message was actually 412 (but it was not visable to me); so this implied file in use/being modified. Adding the following allowed Azure Synapse to ignore this and query the file regardless:

ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'

or for external tables:

TABLE_OPTIONS = N'{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'

This made my final query:

SELECT
        Parsed.*
        FROM OPENROWSET
        (
            bulk '2021/*/**.log',
            maxerrors = 2147483647,
            data_source = 'analytics_master_key',
            format = 'csv',
            fieldterminator ='0x0b',
            fieldquote = '0x0b',
            ROWSET_OPTIONS = '{"READ_OPTIONS":["ALLOW_INCONSISTENT_READS"]}'
        ) WITH (doc nvarchar(max)) AS Rows
        CROSS APPLY OPENJSON(Rows.doc)
        WITH
            (
                col1 NVARCHAR(100),
                col2 NVARCHAR(100),
                ...,
                coln NVARCHAR(MAX)
            ) AS Parsed
William Moore
  • 3,844
  • 3
  • 23
  • 41