Azure Synapse Serverless -- Unexpected end of stream loading parquet in Azure Synapse Serverless pool

Question

Trying to load the following parquet file, results in an Unexpected end of stream error. Perhaps the parquet file is corrupted?

intended query

SELECT * FROM OPENROWSET(
        BULK N'/path/to/my.parquet',
        DATA_SOURCE='my_blob_datasource',
        FORMAT='PARQUET'
) as somehow_necessary_alias

error statement

Statement ID: {1F956C7F-FA37-4D34-AC0D-D4BDF8336160}
Query hash: 0x729ED47717032037
Distributed request ID: {5815A501-2179-4968-BE71-3B7AACFE717C}.
Total size of data scanned is 27 megabytes,
total size of data moved is 290 megabytes,
total size of data written is 0 megabytes.

Msg 15813, Level 16, State 1, Line 1 Error handling external file: 
'Unexpected end of stream'. File/External table name:
https://mystorageaccount.blob.core.windows.net/mycontainer/path/to/my.parquet

@maya-msft no file, but I've at least provided an explanation! — Anders Swanson, Sep 16 '21 at 01:53

score 2 · Accepted Answer · answered Sep 16 '21 at 01:52

So we found out the answer, but I couldn't make a reproducible example that I could share, but I'll give as much explanation as I can. The parquet file was created by pandas.to_parquet() using the pyarrow engine. There's currently an arrow bug where an as-of-yet un-diagnosed pandas index quick translates to an extra column being added at the end named "__index_level_0__". I imagine this is what was surfacing the Unexpected end of stream error.

This issue was first flagged as a pandas issue, but then moved to an Arrow Jira ticket which has a reproducible example.

However, our scenario was different in that we weren't using pandas.to_parquet()'s partition_cols param, so something else kooky was going on. Fortunately, the pandas docs touch on this by suggesting the following:

the code below creates a parquet file with three columns if you use pyarrow for serialization: a, b, and __index_level_0__. If you’re using fastparquet, the index may or may not be written to the file.

python df.to_parquet("test.parquet", engine="pyarrow")

This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table.

If you want to omit a dataframe’s indexes when writing, pass index=False to to_parquet()

p.s. if you're using pandas to create parquet files for upload into Synapse, be sure to pass use_deprecated_int96_timestamps=True to to_parquet() so Synapse will properly parse your datetime columns. See this question for related info

score 1 · Answer 2 · answered Sep 07 '21 at 11:11

I repro’d the scenario and was able to run the query successfully in the Azure Synapse Serverless SQL pool.

Note: Make sure you have the access to the file and ensure you are using UTF-8 database collation.

You can go through this document for more details.

score 1 · Answer 3 · answered Sep 08 '21 at 17:38

1

My issue was that the version of parquet.net I was using incorrectly wrote Int16 data (see issue here: https://github.com/aloneguid/parquet-dotnet/issues/16). I changed the property to just be an Int32 and the error went away.

answered Sep 08 '21 at 17:38

David Faivre

2,302
3
23
25

Azure Synapse Serverless -- Unexpected end of stream loading parquet in Azure Synapse Serverless pool

intended query

error statement

3 Answers3