5

I am using AWS DMS for migrating data from MYSQL as source endpoint and S3 as target endpoint.
I want to track the updates from source so during the configuration, I have enabled TimestampColumnName property (col name : event_timestamp).
In the result (listed below), I am getting the timestamp of records/events but NOT the micro-second precision.

I want microsecond precision to build sequence logic on top of that.
I have investigated the property of source endpoint as well as target but not getting desired result. Here is the sample output :

enter image description here .

Can somebody take a look and suggest if I am missing any property.
Output format: for my file in S3 is parquet.

BlackJack
  • 139
  • 3
  • 14
  • Can some AWS DMS throw some light on the question ? – BlackJack Jan 09 '20 at 22:16
  • What tool are you use to query the data out of s3? – Chris Pollard Jan 11 '20 at 05:17
  • Let's see the schema, the code, etc., that stores/loads/transmits the timestamps. – Rick James Jan 11 '20 at 16:38
  • @ChrisPollard: I am outputting a parquet file in S3 and reading the data using spark on my local machine. The screenshot in the question is the output I am getting. – BlackJack Jan 12 '20 at 02:16
  • 1
    Do you have `ParquetTimestampInMillisecond` parameter unset or false? – Oleg Russkin Jan 14 '20 at 14:39
  • @OlegRusskin: Since you reminded me about `ParquetTimestampInMillisecond ` property, i tried again and got it reset to `True` but even after that I am getting 000000 in milisecond value of timestamp. – BlackJack Jan 14 '20 at 20:09
  • If False (which is default if unset) `ParquetTimestampInMillisecond` will not change precision to millisecond, meaning in might be microsecond. So, you at least do not need it True. – Oleg Russkin Jan 14 '20 at 21:16
  • One alternative is to leverage the ordering of operations in the CDC file. Try generating a row-number column to the CDC files before actually processing them. – kjsr7 May 18 '20 at 05:21

2 Answers2

6

Unfortunately DATETIME column added by AWS DMS S3 TimestampColumnName for change data capture (CDC) load from MySQL source will have only second precision.

Because transaction timestamp in MySQL binary log has only seconds.


Simplest solution is to add to MySQL table new column - timestamp with microsecond precision with default value to be set on insert or / and update automatically and use this column as event_timestamp.

ts TIMESTAMP(6) DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP

Also, check that in AWS DMS to S3 setting ParquetTimestampInMillisecond is False (or not present / unset, false is default).


AWS DMS S3 TimestampColumnName setting adds a column with timestamp to the output.

In 'static' read - it will generate current timestamp:

For a full load, each row of this timestamp column contains a timestamp for when the data was transferred from the source to the target by DMS.

For CDC it will read transaction time from database transaction log:

For a change data capture (CDC) load, each row of the timestamp column contains the timestamp for the commit of that row in the source database.

And its precision will be the one of the timestamp in database transaction log:

...the rounding of the precision depends on the commit timestamp supported by DMS for the source database.

CDC mode is essentially replication. Source database should be configured appropriately to write such transaction log. Database writes to this log transaction info along with transaction / commit timestamp.

In case of MySQL this is the binary log. And MySQL binlog timestamp is only 32 bit - just seconds.


Also, this transaction timestamp may not always be in line with the actual order of transactions or the order changes were actually committed at (link 1, link 2).

Oleg Russkin
  • 4,234
  • 1
  • 8
  • 20
  • This is what concerns me : `Because transaction timestamp in MySQL binary log has only seconds.` During full load, when a timestamp till micro-second is supported, even database is supporting micro-second timestamping (like u suggested timestamp (6)), then during CDC why commit logs are not getting written with high precision. – BlackJack Jan 20 '20 at 00:05
  • timestamp(6) - if timestamp field in MySQL table with microsecond precision. So, this field already present in origin db and contains microsecond values stored. – Oleg Russkin Jan 20 '20 at 05:57
  • `TimestampColumnName` - adds new column to exported /synced file (or another target). This column and its value was not present in origin table. If this is full load (not cdc) - value is generated on the fly, and this is `transfer time`, when this row was copied to target. This value is generated by AWS DMS, not origin database, and has microsecond precision. – Oleg Russkin Jan 20 '20 at 05:59
  • 1
    With CDC `TimestampColumnName` takes time value from origin database transaction logs, not generates it at the time of transfer / sync. In CDC mode not table is read and rows copied, but database transaction log (mysql binlog) is read and synced. How database writes its transaction log - depends on the db engine. In case of MySQL - binlog has timestamp for each transaction, and it has only seconds. – Oleg Russkin Jan 20 '20 at 06:03
  • In short - this is MySQL specifics - transaction timestamp is added to binary log and it has only second precision. – Oleg Russkin Jan 20 '20 at 06:23
1

This question is over a year old but I faced the same/similar issue and thought I'd explain how I solved it in case it could help others.

I have tables in RDS and am using DMS to migrate them from RDS to S3. In the DMS task settings, I enabled the timestamp column and parquet file format. I want to use the CDC files that get stored in S3 to upsert into my datalake. So in order to do that, I needed to deduplicate the rows by getting the latest action upon a specific record in the RDS table. But just like the problem you faced, I noticed that the the timestamp column did not have a high precision so selecting rows with the max timestamp did not work, as it would return multiple rows. So I added a new row_number column, ordered by the timestamp column, grouped by id, and selected MAX(row_number). This gave me the latest action from the CDC rows that was applied to my table.

table.withColumn("row_number", row_number().over(Window.partitionBy("table_id").orderBy("dms_timestamp")))

The above is pyspark code as thats the framework I'm using to process my parquet files, but you can do the same in SQL. I noticed that when the records are ordered by the timestamp column, they maintain their original order even if the timestamps are the same.

Hope this could potentially help you with your sequential logic that you were tying to implement.

ashah27
  • 19
  • 3