Spark dataframe "current_timestamp()" issue with apache phoenix

Question

I am saving a spark 1.6 below DF into phoenix table, the problem I am facing is "With Column ("create_ts", current_timestamp ())" insert same timestamp for the entire DF. Please see below example.

I want to have unique timestamp in milliseconds for each row for each job, because of this issue lot of data have been overridden due to same composite key.

Sample data:

+------------------------------------------+------------------------------------------+
|                 JOB_NAME                 |                CREATE_TS                 |
+------------------------------------------+------------------------------------------+
|ETL_JOB_application_1500036106103_27268 | 2017-08-03 06:18:31.593                  |
|ETL_JOB_application_1500036106103_27268 | 2017-08-03 06:18:31.593                  |
|ETL_JOB_application_1500036106103_27268 | 2017-08-03 06:18:31.593                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
|ETL_JOB_application_1500036106103_27266 | 2017-08-03 06:16:39.243                  |
+------------------------------------------+------------------------------------------+

Code:

stagedDataFrame
     .select($"RemoteID", $"TagName", $"TagValueTs", $"Value", $"TagTypeName")
     .withColumn("job_name", lit(s"${etlStatistics2.sqlContext.sparkContext.appName}_${etlStatistics2.sqlContext.sparkContext.applicationId}"))
     .withColumn("create_ts", current_timestamp())
     .withColumn("record_count", lit(etlStatistics2.head().getLong(3)))
     .select($"job_name", $"create_ts",$"record_count",$"RemoteID" as "remoteid", $"TagName" as "tagname", $"TagValueTs" as "tagvalue_ts", $"Value" as "value", $"TagTypeName" as "tagtypename")

I think you can format you timestamp to add more precision. perhaps that's an option? — Paul Bastide, Aug 03 '17 at 10:16
@PaulBastide I think spark df keep the same values for all the rows — nilesh1212, Aug 03 '17 at 12:24

Spark dataframe "current_timestamp()" issue with apache phoenix

0 Answers0