How to transform data in Amazon Athena

Question

I have some data in S3 location in json format. It have 4 columns val, time__stamp, name and type. I would like to create an external Athena table from this data with some transformations given below:

timestamp: timestamp should be converted from unix epoch to UTC, this I did by using the timestamp data type.
name: name should filtered with following sql logic:
name not in ('abc','cdf','fgh') and name not like '%operator%'
type: type should not have values labeled as counter
I would like to add two partition columns date and hour which should be derived from time__stamp column

I started with following:

CREATE EXTERNAL TABLE `airflow_cluster_data`(
  `val` string COMMENT 'from deserializer', 
  `time__stamp` timestamp COMMENT 'from deserializer', 
  `name` string COMMENT 'from deserializer', 
  `type` string COMMENT 'from deserializer')
PARTITIONED BY ( 
  date,
  hour)
ROW FORMAT SERDE 
  'org.openx.data.jsonserde.JsonSerDe' 
WITH SERDEPROPERTIES ( 
  'mapping.time_stamp'='@timestamp') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://bucket1/raw/airflow_data'

I tried various things but couldn't figure out the syntax. Using spark could have been easier but I don't want to run Amazon EMR every hour for small data set. I prefer to do it in Athena if possible.

Please have a look at some sample data:

1533,1636674330000,abc,counter
1533,1636674330000,xyz,timer
1,1636674330000,cde,counter
41,1636674330000,cde,timer
1,1636674330000,fgh,counter
231,1636674330000,xyz,timer
1,1636674330000,abc,counter
2431,1636674330000,cde,counter
42,1636674330000,efg,timer

Sorry, but what is your actual situation? Are you saying that you can successfully select data from the `airflow_cluster_data` table, and you just want to perform those changes (1-4) that you listed? What do you mean by "should not have values labeled as counter" -- do you want to remove those rows? What have you tried so far (eg creating a view)? — John Rotenstein, Apr 07 '22 at 07:07
for the `time_stamp` field, first load the data with bigint as it is epoch (not a timestamp) and then shoot a query to transform the bigint to timestamp. see [this](https://stackoverflow.com/questions/44420926/casting-unix-time-to-date-in-presto) for the functions to use. P.S. - Athena is based on/off Presto. — samkart, Apr 07 '22 at 10:21
@JohnRotenstein Hi John, I created a table from the s3 data without changing anything but I would like to implement the necessary changes into that table or create a view on top of the original table. and yes, I would like to remove the rows that have counter. I started to create a view but keep getting syntax errors, couldn't figure out how to implement those changes. — ndev, Apr 07 '22 at 16:05

score 0 · Answer 1 · answered Apr 07 '22 at 10:14

0

You can create you own UDF for transformation and use it in Athena. https://docs.aws.amazon.com/athena/latest/ug/querying-udf.html

answered Apr 07 '22 at 10:14

Ashutosh gupta

447
4
16

Thanks Ashutosh, but our work environment doesn't support creating those UDF's at the moment. Is there some alternate way I can explore. – ndev Apr 07 '22 at 15:53

John Rotenstein · Accepted Answer · 2022-04-18T21:37:20.413

0

Probably the simplest method is to create a View:

CREATE VIEW foo AS
SELECT
  val,
  cast(from_unixtime(time__stamp / 1000) as timestamp) as timestamp,
  cast(from_unixtime(time__stamp / 1000) as date) as date,
  hour(cast(from_unixtime(time__stamp / 1000) as timestamp)) as hour,
  name,
  type
FROM airflow_cluster_data
WHERE name not in ('abc','cdf','fgh')
  AND name not like '%operator%'
  AND type != 'counter'

edited Apr 18 '22 at 21:37

answered Apr 07 '22 at 22:01

John Rotenstein

241,921
22
380
470

This worked really well. Thanks a lot. – ndev Apr 18 '22 at 17:16

How to transform data in Amazon Athena

2 Answers2