9

While creating a table in Athena, I am not able to create tables using specific files. Is there any way to select all the files starting with "year_2019" from a given bucket? For e.g. s3://bucketname/prefix/year_2019*.csv


The documentation is very clear about it and it is not allowed.

From: https://docs.aws.amazon.com/athena/latest/ug/tables-location-format.html

Athena reads all files in an Amazon S3 location you specify in the CREATE TABLE statement, and cannot ignore any files included in the prefix. When you create tables, include in the Amazon S3 path only the files you want Athena to read. Use AWS Lambda functions to scan files in the source location, remove any empty files, and move unneeded files to another location.

I will like to know if the community has found some work-around :)

shantanuo
  • 31,689
  • 78
  • 245
  • 403
  • 1
    You can try create a table using all the files and then using condition on the hidden synthetic `"$path"` column to filter what you want. Did you try that? – Piotr Findeisen Mar 13 '19 at 09:12

1 Answers1

15

Unfortunately the filesystem abstraction that Athena uses for S3 doesn't support this. It requires table locations to look like directories, and Athena will add a slash to the end of the location when listing files.

There is a way to create tables that contain only a selection of files, but as far as I know it does not support wildcards, only explicit lists of files.

What you do is you create a table with

STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'

and then instead of pointing the LOCATION of the table to the actual files, you point it to a prefix with a single symlink.txt file (or point each partition to a prefix with a single symlink.txt). In the symlink.txt file you add the S3 URIs of the files to include in the table, one per line.

The only documentation that I know of for this feature is the S3 Inventory documentation for integrating with Athena.

You can also find a full example in this Stackoverflow response: https://stackoverflow.com/a/55069330/1109

Theo
  • 131,503
  • 21
  • 160
  • 205
  • Is there any chance that symbolic link file support will be discontinued by Athena in the future? – shantanuo Apr 06 '19 at 06:18
  • 1
    Very unlikely, it's a Hive feature, and that's not going to go away. It's been around since a very long time. – Theo Apr 06 '19 at 10:51
  • How many times did you encounter timeout error (due to 30 minutes limit)? Will that be a big issue if my data grows beyond a few hundred GB? – shantanuo Apr 09 '19 at 06:47
  • I've only encountered timeouts on multi-terabyte queries, or queries with way too many joins. You should be fine unless you do self joins on those hundreds of GB. – Theo Apr 09 '19 at 10:07