2

Trying to load the following parquet file, results in an Unexpected end of stream error. Perhaps the parquet file is corrupted?

intended query

SELECT * FROM OPENROWSET(
        BULK N'/path/to/my.parquet',
        DATA_SOURCE='my_blob_datasource',
        FORMAT='PARQUET'
) as somehow_necessary_alias

error statement

Statement ID: {1F956C7F-FA37-4D34-AC0D-D4BDF8336160}
Query hash: 0x729ED47717032037
Distributed request ID: {5815A501-2179-4968-BE71-3B7AACFE717C}.
Total size of data scanned is 27 megabytes,
total size of data moved is 290 megabytes,
total size of data written is 0 megabytes.

Msg 15813, Level 16, State 1, Line 1 Error handling external file: 
'Unexpected end of stream'. File/External table name:
https://mystorageaccount.blob.core.windows.net/mycontainer/path/to/my.parquet
jarlh
  • 42,561
  • 8
  • 45
  • 63
Anders Swanson
  • 3,637
  • 1
  • 18
  • 43

3 Answers3

2

So we found out the answer, but I couldn't make a reproducible example that I could share, but I'll give as much explanation as I can. The parquet file was created by pandas.to_parquet() using the pyarrow engine. There's currently an arrow bug where an as-of-yet un-diagnosed pandas index quick translates to an extra column being added at the end named "__index_level_0__". I imagine this is what was surfacing the Unexpected end of stream error.

This issue was first flagged as a pandas issue, but then moved to an Arrow Jira ticket which has a reproducible example.

However, our scenario was different in that we weren't using pandas.to_parquet()'s partition_cols param, so something else kooky was going on. Fortunately, the pandas docs touch on this by suggesting the following:

the code below creates a parquet file with three columns if you use pyarrow for serialization: a, b, and __index_level_0__. If you’re using fastparquet, the index may or may not be written to the file.

python df.to_parquet("test.parquet", engine="pyarrow")

This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column doesn’t exist in the target table.

If you want to omit a dataframe’s indexes when writing, pass index=False to to_parquet()

p.s. if you're using pandas to create parquet files for upload into Synapse, be sure to pass use_deprecated_int96_timestamps=True to to_parquet() so Synapse will properly parse your datetime columns. See this question for related info

Anders Swanson
  • 3,637
  • 1
  • 18
  • 43
1

I repro’d the scenario and was able to run the query successfully in the Azure Synapse Serverless SQL pool.

enter image description here

Note: Make sure you have the access to the file and ensure you are using UTF-8 database collation.

You can go through this document for more details.

NiharikaMoola-MT
  • 4,700
  • 1
  • 3
  • 15
1

My issue was that the version of parquet.net I was using incorrectly wrote Int16 data (see issue here: https://github.com/aloneguid/parquet-dotnet/issues/16). I changed the property to just be an Int32 and the error went away.

David Faivre
  • 2,302
  • 3
  • 23
  • 25