12

I have been playing around with AWS Glue for some quick analytics by following the tutorial here

While I have been able to successfully create crawlers and discover data in Athena, I've had issues with the data types created by the crawler. The date and timestamp data types get read as string data types.

I followed this up by creating an ETL job in GLUE using the data source created by the crawler as the input and a target table in Amazon S3

As part of the mapping transformation, I converted the data types of the date and timestamp as string to timestamp but unfortunately the ETL converted these column types to NULLS. I have contemplated using classifiers with GROK expressions but then decided transform them as part of ETL in GLUE.

The timestamp format is as 1/08/2010 6:15:00 PM

Balajee Addanki
  • 690
  • 2
  • 9
  • 23

2 Answers2

6

Following are the built-in classifiers for the timestamp data. As you can see the data you have does not match any of them. You will need to create a custom classifier and then add that to a new Crawler (updating existing crawler does not work)

 TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?

 TZ (?:[PMCE][SD]T|UTC)

 DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}

 DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}

 DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}

 DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}

 CISCOTIMESTAMP %{MONTH} %{MONTHDAY} %{TIME}
Faiz
  • 5,331
  • 10
  • 45
  • 57
Tanveer Uddin
  • 1,520
  • 9
  • 15
  • 1
    The complete list of built-in patterns: https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#classifier-builtin-patterns – Daniel Nov 11 '20 at 10:15
3

I don't know what the format of timestamp you use therefore I assumed that the format is the following:

2018-05-30T12:22:07.000Z

In this case you can set the serde parameter in the table definition as below:

'timestamp.formats'='yyyy-MM-dd\'T\'HH:mm:ss.SSSZ'
jbgorski
  • 1,824
  • 9
  • 16
  • 1
    Sorry, just updated my post. I am not 100% sure if there is anyway to include this in a GLUE crawler though. – Balajee Addanki Aug 30 '18 at 00:29
  • I mean in the table definition in Athena, not in Glue Crawler. For your format you should use 'timestamp.formats'='d/MM/yyyy h:mm:ss a' – jbgorski Aug 30 '18 at 13:49
  • CREATE EXTERNAL TABLE `example`( `col0` string, `col1` string, `col2` timestamp) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://example' TBLPROPERTIES ( 'classification'='csv', 'columnsOrdered'='true', 'compressionType'='none', 'delimiter'='\;', 'skip.header.line.count'='1', 'timestamp.formats'='d/MM/yyyy h:mm:ss a', 'transient_lastDdlTime'='1535632231', 'typeOfData'='file') – jbgorski Aug 30 '18 at 13:51
  • 3
    Thanks, I was able to create an external table in Athena directly using the right timestamp format. But I guess the question is how to have the crawler identify appropriate `data types` and make the results available for analytics in Athena. Whether using grok expression or have the ETL job for data transformation. – Balajee Addanki Aug 31 '18 at 00:54
  • 1
    @BalajeeAddanki ,In the Tables: I set the :'timestamp.formats'='d/MM/yyyy h:mm:ss a' with my data looks exact the same as :1/08/2010 6:15:00 PM, however, when I set this column as timestamp, and query it in Aethan, it showed the error like : IVE_BAD_DATA: Error parsing field value '1/08/2010 6:15:00 PM' for field 0: Timestamp format must be yyyy-mm-dd hh:mm:ss[.fffffffff] . so are you settting this column as string or timestamp in the table ? – Emma Y Feb 18 '19 at 20:25