0

I have requirement to convert a input string 08-DEC-2011 00.00.00 to timestamp 20111208000000 - The data in the file.

"CLIENTCONTEXTID","SRVR","CLNT","USERNAME","UPDATEDTM"
1202,"jbosswabcd6","100.126.164.172","SUSER",08-DEC-2011 00.00.00
1653,"jbossbabcd4","17.26.164.176","SUSER",08-DEC-2011 00.00.00
1654,"jbossbabcd4","12.26.164.221","SUSER",08-JAN-2011 00.00.00
1655,"jbossbabcd4","17.26.164.223","SuperUser",08-DEC-2011 00.00.00
1656,"jbossbabcd4","17.26.164.221","SUSER",08-DEC-2011 00.00.00
1657,"jbossbabcd4","17.26.164.226","SUSER",08-DEC-2011 00.00.00
1658,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
1659,"jbossbabcd4","100.26.164.221","SUSER",08-DEC-2011 00.00.00
2802,"jbosswabcd1","172.20.19.130","SuperUser",08-DEC-2011 00.00.00

As first step i am trying to convert this to timestamp

from pyspark.sql import functions as F
from pyspark.sql.functions import unix_timestamp
df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df.createOrReplaceTempView("tablesss")    
spark.sql('select UPDATEDTM,cast(unix_timestamp(UPDATEDTM,"dd-MMM-YYYY HH.mm.ss") as timestamp) columnn12 from tablesss').show()

I tried another way as well

df = spark.read.csv('/hdfs/context.csv',header=True)
df.printSchema()
df=df.withColumn("datetype_timestamp",F.to_timestamp(F.col("UPDATEDTM"),"dd-MMM-YYYY HH.mm.ss"))
df.select(F.col("UPDATEDTM"),F.col("datetype_timestamp")).show()

Both are resulting in the below

+--------------------+-------------------+
|           UPDATEDTM| datetype_timestamp|
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|
|08-DEC-2011 00.00.00|2010-12-26 00:00:00|

However if i do it using the scala

val df = spark.read.option("header",true).csv("/hdfs/context.csv")
df.withColumn("datetype_timestamp", to_timestamp(col("UPDATEDTM"),"dd-MMM-yyyy HH.mm.ss")).show(false)

The results yielded is

+--------------------+-------------------+
|UPDATEDTM           |datetype_timestamp |
+--------------------+-------------------+
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
|08-DEC-2011 00.00.00|2011-12-08 00:00:00

Not Sure whats the difference between Scala and Pyspark. Same is reacting weird.

After this timestamp i thought of applyting Python strftime to convert to required format.

Please help in identifying the issue and optimal way for this conversion

Rafa
  • 487
  • 7
  • 22

1 Answers1

0

Use yyyy(ordinary calendar year) instead of YYYY(week based calendar).

In scala code you are using yyyy and in pyspark YYYY.

Example:

df.show()
#+--------------------+
#|           UPDATEDTM|
#+--------------------+
#|08-DEC-2011 00.00.00|
#+--------------------+

df.withColumn("dd",to_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss')).show()

#using unix_timestamp function
df.withColumn("dd",unix_timestamp(col("UPDATEDTM"),'dd-MMM-yyyy HH.mm.ss').cast("timestamp")).show()
#+--------------------+-------------------+
#|           UPDATEDTM|                 dd|
#+--------------------+-------------------+
#|08-DEC-2011 00.00.00|2011-12-08 00:00:00|
#+--------------------+-------------------+
notNull
  • 30,258
  • 4
  • 35
  • 50