Presto failing to query hive table

Question

On EMR I created a dataset in parquet using spark and storing it on S3. I am currently able to create an external table and query it using hive but when I try to perform the same query using presto I obtain an error (the part referred changes at every run).

2016-11-13T13:11:15.165Z        ERROR   remote-task-callback-36 com.facebook.presto.execution.StageStateMachine Stage 20161113_131114_00004_yp8y5.1 failed
com.facebook.presto.spi.PrestoException: Error opening Hive split s3://my_bucket/my_table/part-r-00013-b17b4495-f407-49e0-9d15-41bb0b68c605.snappy.parquet (offset=1100508800, length=68781800): null
        at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:475)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.<init>(ParquetHiveRecordCursor.java:247)
    at com.facebook.presto.hive.parquet.ParquetRecordCursorProvider.createHiveRecordCursor(ParquetRecordCursorProvider.java:96)
    at com.facebook.presto.hive.HivePageSourceProvider.getHiveRecordCursor(HivePageSourceProvider.java:129)
    at com.facebook.presto.hive.HivePageSourceProvider.createPageSource(HivePageSourceProvider.java:107)
    at com.facebook.presto.spi.connector.classloader.ClassLoaderSafeConnectorPageSourceProvider.createPageSource(ClassLoaderSafeConnectorPageSourceProvider.java:44)
    at com.facebook.presto.split.PageSourceManager.createPageSource(PageSourceManager.java:48)
    at com.facebook.presto.operator.TableScanOperator.createSourceIfNecessary(TableScanOperator.java:268)
    at com.facebook.presto.operator.TableScanOperator.isFinished(TableScanOperator.java:210)
    at com.facebook.presto.operator.Driver.processInternal(Driver.java:375)
    at com.facebook.presto.operator.Driver.processFor(Driver.java:301)
    at com.facebook.presto.execution.SqlTaskExecution$DriverSplitRunner.processFor(SqlTaskExecution.java:622)
    at com.facebook.presto.execution.TaskExecutor$PrioritizedSplitRunner.process(TaskExecutor.java:529)
    at com.facebook.presto.execution.TaskExecutor$Runner.run(TaskExecutor.java:665)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.EOFException
    at java.io.DataInputStream.readFully(DataInputStream.java:197)
    at java.io.DataInputStream.readFully(DataInputStream.java:169)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:420)
    at parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:385)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.lambda$createParquetRecordReader$0(ParquetHiveRecordCursor.java:416)
    at com.facebook.presto.hive.authentication.NoHdfsAuthentication.doAs(NoHdfsAuthentication.java:23)
    at com.facebook.presto.hive.HdfsEnvironment.doAs(HdfsEnvironment.java:76)
    at com.facebook.presto.hive.parquet.ParquetHiveRecordCursor.createParquetRecordReader(ParquetHiveRecordCursor.java:416)
    ... 16 more

The parquet location is constituted by 128 parts - the data is stored on S3 and encrypted using client-side encryption with KMS. Presto uses a custom encryption-materials provider (specified using presto.s3.encryption-materials-provider) that simply returns a KMSEncryptionMaterials object initialized with my master key. I am using EMR 5.1.0 (Hive 2.1.0, Spark 2.0.1, Presto 0.152.3).

score 0 · Accepted Answer · answered Nov 15 '16 at 10:48

0

Does this surface when encryption is turned off?

There was a bugreport which surfaced against the ASF s3a client (not the EMR one), where things were breaking when the filesystem listed length != actual file length. That is: because of the encryption, the file length in a list was > the length in a read.

We couldn't repro this in our tests, and our conclusion anyway was "filesystems must not do that" (indeed, it's a fundamental requirement of the Hadoop FS spec: listed len must equal actual length). If the EMR code is getting this wrong, then it's something in their driver which the downstream code cannot be expected to handle

answered Nov 15 '16 at 10:48

stevel

12,567
1
39
50

It works with unencrypted objects - you might be on the right track - since this is a comment I find in Presto code: // NOTE: for encrypted objects, S3ObjectSummary.size() used below is NOT correct, // however, to get the correct size we'd need to make an additional request to get // user metadata, and in this case it doesn't matter. – Sebastiano Merlino Nov 16 '16 at 00:52
How do I turn off encryption? – Rodrigo Ney Nov 17 '16 at 21:20
Files were missing the "x-amz-unencrypted-content-length" metadata value. Presto needs this to be set to work properly with files encrypted using CSE – Sebastiano Merlino Nov 21 '16 at 02:58
So the real size is listed in the header but not anything returned in the listing operation? That's trouble, deep trouble, as in "things shouldn't work like that". – stevel Nov 22 '16 at 20:43
Yes, in a bit of a tricky way though - once the files are written to S3 in hadoop style by EMR-FS, my script downloads the parts (implicitly unencrypting) and re-uploads them setting the header "x-amz-unencrypted-content-length" (in reality I issue an s3-copy over the same location but setting the header to save the upload time). – Sebastiano Merlino Jan 27 '17 at 08:03

Presto failing to query hive table

1 Answers1