5

What's the data format of the .csv.metadata files written by Amazon Athena?

Alongside the output file of every query there is a metadata file. It looks like it describes the schema of the result. I assume this is what Athena uses to create the ResultSet.ResultSetMetadata part of the response of GetQueryResults requests, and that it is somehow created by Hive or Presto.

2019-04-23 14:51:29         27 e7629796-9b91-476a-bfb7-2fe6c9595bce.csv
2019-04-23 14:51:29         56 e7629796-9b91-476a-bfb7-2fe6c9595bce.csv.metadata
2019-04-27 14:23:53    1591958 ebe432ac-db7b-4ea1-b5de-529350d1a02a.csv
2019-04-27 14:23:53        712 ebe432ac-db7b-4ea1-b5de-529350d1a02a.csv.metadata
2019-04-25 16:31:23      10152 eeb6f4ab-9ac3-4a7e-81c4-0cc155187acb.csv
2019-04-25 16:31:23        494 eeb6f4ab-9ac3-4a7e-81c4-0cc155187acb.csv.metadata
2019-04-25 22:30:56   22384376 f0160ff7-e5b3-466d-926a-a660a5208c5f.csv
2019-04-25 22:30:56        494 f0160ff7-e5b3-466d-926a-a660a5208c5f.csv.metadata

Here's a hexdump of e7629796-9b91-476a-bfb7-2fe6c9595bce.csv.metadata from the listing above:

00000000  0a 1b 32 30 31 39 30 34  32 33 5f 31 32 35 31 32  |..20190423_12512|
00000010  38 5f 30 30 30 30 31 5f  65 68 74 75 72 22 19 0a  |8_00001_ehtur"..|
00000020  04 68 69 76 65 22 03 61  72 79 2a 03 61 72 79 32  |.hive".ary*.ary2|
00000030  05 61 72 72 61 79 48 03                           |.arrayH.|

It's ResultSet.ResultSetMetadata looks like this:

"ResultSetMetadata": {
  "ColumnInfo": [
    {
      "CatalogName": "hive",
      "SchemaName": "",
      "TableName": "",
      "Name": "ary",
      "Label": "ary",
      "Type": "array",
      "Precision": 0,
      "Scale": 0,
      "Nullable": "UNKNOWN",
      "CaseSensitive": false
    }
  ]
}

I realise that these are internal file to Athena, but I'm curious.

Theo
  • 131,503
  • 21
  • 160
  • 205
  • 1
    I have reverse engineered the format, here's a parser: https://gist.github.com/iconara/4969c247e8abb69600cdbe6f4b20f50b – however, I would still like to know if there is a real answer to this question, and if the assumptions I've made in my parser are correct. – Theo May 13 '19 at 07:57
  • Some more research indicates that the format is in fact a Protocol Buffers. – Theo May 13 '19 at 14:38
  • 1
    Seems to be a protobuf encoded version of presto-jdbc/src/main/java/com/facebook/presto/jdbc/PrestoResultSetMetaData.java and presto-jdbc/src/main/java/com/facebook/presto/jdbc/ColumnInfo.java – nijave Apr 09 '21 at 19:47

1 Answers1

-2

Metadata files are not human readable (binary format) and are meant for Athena.

From AWS documentation:

DML and DDL query metadata files are saved in binary format and are not human readable. The file extension corresponds to the related query results file. Athena uses the metadata when reading query results using the GetQueryResults action. Although these files can be deleted, we do not recommend it because important information about the query is lost.

For more details look into "Identifying query output files" section in : https://docs.aws.amazon.com/athena/latest/ug/querying.htmlIdentifying

Ash
  • 1,180
  • 3
  • 22
  • 36
  • > I realise that these are internal file to Athena, but I'm curious. – Theo Apr 29 '22 at 09:41
  • Curious of what? Why it's in binary form! Or why Athena uses an internal file to manage file schema? – Ash Apr 30 '22 at 10:34