1

I'm having trouble creating a table in Athena - that points at files with the following format: string, string, string, array.

when I wrote the file - I delimited the array items with '|'. I delimited each line with '\n' and each column with ','.

so for example, a row in my CSV would look like that:

Garfield, 15, orange, fish|milk|lasagna

in hive (according to the documentation i read)- when creating a table with a row delimited format - while stating the delimiters you can state a 'collection items' delimiter - that states the delimiter between elements in array columns. I could not find an equivalent for Presto in the documentation,

Is anyone aware if it's possible, if so - what is the format, or where can I find it?

i tried "guessing" many forms, including 'collection items', none seem to work.

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
  COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'

Would really appreciate any insight, Thanks! :)

liormayn
  • 203
  • 3
  • 12

3 Answers3

2

(Note: this answer is applicable to Presto in general, but not to Athena)

Currently you cannot set collection delimiter in Presto. Please create a feature request @ https://github.com/prestosql/presto/issues/

Note, we plan to provide generic support for table properties to address cases like this holistically -- https://github.com/prestosql/presto/issues/954. You can track the issue and associated pull request for updates.

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
  • Hey , i changed the order of the delimiters and it worked, so that's strange. if i use this order: ` ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' ` it created the table and recognized the array – liormayn Nov 06 '19 at 12:47
  • 1
    @liormayn i overlooked you're using Athena. While Athena is hosted Presto, it actually uses Hive syntax for DDLs. My answer was about Presto only. – Piotr Findeisen Nov 06 '19 at 20:16
2

According to AWS Athena docs on using SerDe, your guess was 100% correct.

In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify ROW FORMAT DELIMITED

ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'

Now, when I simply tried your DDL statement, I would get

line 1:8: no viable alternative at input 'create external'

However by deleting LINES TERMINATED BY '\n', I was able to create table schema in meta catalog

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'

Sample file with lines as shown in your file would get parsed correctly and I was able to do UNNEST on foods column:

SELECT * 
FROM "cats" 
CROSS JOIN UNNEST(foods) as t(food)

which resulted in sample_result

Moreover, it was also enough to simply swap lines LINES TERMINATED BY '\n' and COLLECTION ITEMS TERMINATED BY '|' for query to work (although I don't have an explanation for it)

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','   
  COLLECTION ITEMS TERMINATED BY '|'
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'
Ilya Kisil
  • 2,490
  • 2
  • 17
  • 31
  • Yes you are right, i got to the same conclusion but in a different way - the problem is the order, if you set the collection items before the lines delimieter it works. – liormayn Nov 06 '19 at 12:49
0

I'm use presto engine creating a hive table , set collection delimiter in Presto, for example:

CREATE TABLE IF NOT EXISTS test (        
       id bigint COMMENT 'ID',                   
       type varchar COMMENT 'TYPE',                   
       content varchar COMMENT 'CONTENT',          
       create_time timestamp(3) COMMENT 'CREATE TIME',          
       pt varchar                                    
)                                                 
COMMENT 'create time 2021/11/04 11:27:53'` 
WITH (                                             
       format = 'TEXTFILE',                            
       partitioned_by = ARRAY['pt'],                   
       textfile_field_separator = U&'\0001'            
)
Peter Csala
  • 17,736
  • 16
  • 35
  • 75