presto syntax for csv external table with array in one of the fields

Question

I'm having trouble creating a table in Athena - that points at files with the following format: string, string, string, array.

when I wrote the file - I delimited the array items with '|'. I delimited each line with '\n' and each column with ','.

so for example, a row in my CSV would look like that:

Garfield, 15, orange, fish|milk|lasagna

in hive (according to the documentation i read)- when creating a table with a row delimited format - while stating the delimiters you can state a 'collection items' delimiter - that states the delimiter between elements in array columns. I could not find an equivalent for Presto in the documentation,

Is anyone aware if it's possible, if so - what is the format, or where can I find it?

i tried "guessing" many forms, including 'collection items', none seem to work.

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  LINES TERMINATED BY '\n' 
  COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'

Would really appreciate any insight, Thanks! :)

Piotr Findeisen · Answer 1 · 2019-11-07T08:37:23.873

2

(Note: this answer is applicable to Presto in general, but not to Athena)

Currently you cannot set collection delimiter in Presto. Please create a feature request @ https://github.com/prestosql/presto/issues/

Note, we plan to provide generic support for table properties to address cases like this holistically -- https://github.com/prestosql/presto/issues/954. You can track the issue and associated pull request for updates.

edited Nov 07 '19 at 08:37

answered Nov 06 '19 at 11:15

Piotr Findeisen

19,480
2
52
82

Hey , i changed the order of the delimiters and it worked, so that's strange. if i use this order: ` ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' COLLECTION ITEMS TERMINATED BY '|' LINES TERMINATED BY '\n' ` it created the table and recognized the array – liormayn Nov 06 '19 at 12:47
1

@liormayn i overlooked you're using Athena. While Athena is hosted Presto, it actually uses Hive syntax for DDLs. My answer was about Presto only. – Piotr Findeisen Nov 06 '19 at 20:16

Ilya Kisil · Accepted Answer · 2019-11-06T12:58:00.573

According to AWS Athena docs on using SerDe, your guess was 100% correct.

In general, Athena uses the LazySimpleSerDe if you do not specify a ROW FORMAT, or if you specify ROW FORMAT DELIMITED
ROW FORMAT
DELIMITED FIELDS TERMINATED BY ','
ESCAPED BY '\\'
COLLECTION ITEMS TERMINATED BY '|'
MAP KEYS TERMINATED BY ':'

Now, when I simply tried your DDL statement, I would get

line 1:8: no viable alternative at input 'create external'

However by deleting LINES TERMINATED BY '\n', I was able to create table schema in meta catalog

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ',' 
  COLLECTION ITEMS TERMINATED BY '|'
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'

Sample file with lines as shown in your file would get parsed correctly and I was able to do UNNEST on foods column:

SELECT * 
FROM "cats" 
CROSS JOIN UNNEST(foods) as t(food)

which resulted in

Moreover, it was also enough to simply swap lines LINES TERMINATED BY '\n' and COLLECTION ITEMS TERMINATED BY '|' for query to work (although I don't have an explanation for it)

CREATE EXTERNAL TABLE `cats`(
  `name` string, 
  `age` string, 
  `color` string, 
  `foods` array<string>)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','   
  COLLECTION ITEMS TERMINATED BY '|'
  LINES TERMINATED BY '\n' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  'some-location'

Yes you are right, i got to the same conclusion but in a different way - the problem is the order, if you set the collection items before the lines delimieter it works. — liormayn, Nov 06 '19 at 12:49

score 0 · Answer 3 · edited Nov 04 '21 at 07:56

I'm use presto engine creating a hive table , set collection delimiter in Presto, for example:

CREATE TABLE IF NOT EXISTS test (        
       id bigint COMMENT 'ID',                   
       type varchar COMMENT 'TYPE',                   
       content varchar COMMENT 'CONTENT',          
       create_time timestamp(3) COMMENT 'CREATE TIME',          
       pt varchar                                    
)                                                 
COMMENT 'create time 2021/11/04 11:27:53'` 
WITH (                                             
       format = 'TEXTFILE',                            
       partitioned_by = ARRAY['pt'],                   
       textfile_field_separator = U&'\0001'            
)

presto syntax for csv external table with array in one of the fields

3 Answers3