5

Experimenting with AWS Athena. Am attempting to create a table from an S3 bucket which has files structures like so:

my-bucket/
my-bucket/group1/
my-bucket/group1/entry1/
my-bucket/group1/entry1/data.bin
my-bucket/group1/entry1/metadata
my-bucket/group1/entry2/
my-bucket/group1/entry2/data.bin
my-bucket/group1/entry2/metadata
...
my-bucket-group2/
...

Only the metadata files are JSON files. Each one looks like this:

{
    "key1": "value1",
    "key2": "value2",
    "key3": n
}

So I tried to create a table:

CREATE EXTERNAL TABLE example (
  key1 string,
  key2 string,
  key3 int
)
ROW FORMAT  serde 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://my-bucket/'

The create query succeeded, but when I attempt to query:

SELECT * FROM preserved_recordings limit 10;

I get an error:

Query 93aa62d6-8a52-4a5d-a2fb-08a6e00181d3 failed with error code HIVE_CURSOR_ERROR: org.codehaus.jackson.JsonParseException: Unexpected end-of-input: expected close marker for OBJECT (from [Source: java.io.ByteArrayInputStream@2da7f4ef; line: 1, column: 0]) at [Source: java.io.ByteArrayInputStream@2da7f4ef; line: 1, column: 3]

Does AWS Athena require all files in the bucket to be JSON in this case? I'm not sure if the .bin files are causing the cursor error, or if something else is going on. Has anyone else encountered this, or can clue me in at to what is going on?

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
thinkski
  • 1,306
  • 1
  • 15
  • 25

2 Answers2

3

Yes, Athena (Presto, Hive) requires that the files stored within the table's LOCATION have a consistent format. I believe you need to move the files to make separate tables for each underlying data schema.

James
  • 11,721
  • 2
  • 35
  • 41
  • Thanks James. Is this in the documentation anywhere? – thinkski Dec 05 '16 at 18:01
  • Not that I have found. I don't believe the concept is supported by Hive table definitions (see [11269203 discussion](http://stackoverflow.com/q/11269203) ), and I did not find docs for a Presto feature that would exclude files from a select. – James Dec 05 '16 at 18:44
1

Recently I discovered that if you put the file with the precedence _ then hive will ignore them. So in your example, you may rename your file to _data.bin and then the file will be ignored.

Eugene
  • 1,865
  • 3
  • 21
  • 24