2

I have multiple file in s3 bucket directory (All files are in same directory). All files have different structure like if one file has 4 columns then second file has 6 different columns. How can I use these files to create Athena table?

Prakash
  • 45
  • 4
  • 11
  • It depends on your file format, whether the SerDe can be adapted into reading it. Can you explain your file format, e.g. JSON, CSV, ORC? – jens walter Apr 25 '17 at 09:58
  • I am using json file – Prakash Apr 26 '17 at 11:22
  • Please share a sample (let say 3 rows) from each file – David דודו Markovitz Apr 26 '17 at 19:03
  • first.json: {"MPR-UT":"47867739","MPR-Prd":"EEE","MPR-Process":"LOAD","MPR-TimeStamp":"2017-04-11 16:13:23"} {"MPR-UT":"78676765","MPR-Prd":"ABC","MPR-Process":"LOAD","MPR-TimeStamp":"2017-04-11 16:13:24"} {"MPR-UT":"87420087","MPR-Prd":"XYZI","MPR-Process":"LOAD","MPR-TimeStamp":"2017-04-11 16:03:25"} second.json: {"CPR-BatchID":"PT5_20170407","CPR-Prd":"X","CPR-UT":"47867739","CPR-CoverageStatus":"N"} {"CPR-BatchID":"PT5_20170406","CPR-Prdt":"X","CPR-UT":"78676765","CPR-CoverageStatus":"N"} {"CPR-BatchID":"PT5_20170405","CPR-Prd":"X","CPR-UT":"87420087","CPR-CoverageStatus":"N"} – Prakash Apr 27 '17 at 07:19

4 Answers4

1

I believe this is currently not possible with AWS Athena / Presto (please tell me otherwise!).

The external location for an Athena table (always EXTERNAL TABLE) must be a directory.

s3://mybucket/folder/table/file_a and s3://mybucket/folder/table/file_b with different stuctures will lead to SCHEMA_MISMATCH.

Athena expects tables in the format s3://mybucket/folder/table1/ and s3://mybucket/folder/table2/.

Update:

From a conversation with AWS support:

For the first path requirements you have to move the files per table wise in separate folders else Athena will always consider all the files are for a single table.

kadrach
  • 408
  • 6
  • 11
1

You can create tables in Athena that point to specific files.

What you do is you create a table with a special input format:

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.

The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.

You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

Theo
  • 131,503
  • 21
  • 160
  • 205
0

If your files contain different data, you'll need to define them as different external tables (each with their own CREATE EXTERNAL TABLE statement).

You could then run queries across multiple tables via JOINs.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
-1

With the following scenario.

data1.json:

{"a":"data1","b":"data2"}

data2.json

{"c":"data3","d":"data4"}

you can create the following tables:

create external table data1 (
  a string,
  b string
)
ROW FORMAT  serde 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/x';
create external table data2 (
  c string,
  d string
)
ROW FORMAT  serde 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://bucket/x';

Now you can query the separate files, defined through separate tables. The only thing you have to take care of is, that the different file types will generate null entries in your result.

select * from data1 where a is not null;
select * from data2 where c is not null;
jens walter
  • 13,269
  • 2
  • 56
  • 54
  • what if both data1.json and data2.json both are in same directory(say s3://bucket/dir/)? How would I mention location in create table query. Is it a valid scenerio or both file must be present in different directory? – Prakash Apr 27 '17 at 06:06
  • 1
    Techincally, Athena can go through all files in the directory, but it will then create null-rows for not matching file entries. It is possible to filter the null-rows in the query. On the other hand, it would be much cleaner to have one directory per filetype, but this is not always doable. So just creating 2 tables on 1 directory is the next best thing. You can create as much tables for one directory as you want with statements like the ones above with a "LOCATION 's3://bucket/dir'". – jens walter Apr 27 '17 at 07:57