i would like to perform a simple query in a csv file using SparkSQL.My code is this:
def format_date(date):
return date
movies = spark.read.format('csv'). \
options(header='false',
inferSchema='true'). \
load("hdfs://master:9000/movie_data_csv/movies.csv")
movies.registerTempTable("movies")
spark.udf.register("date_formatter", format_date)
sqlString = \
"select date_formatter(_c3) as Year" + \
"from movies " + \
"limit 1"
res = spark.sql(sqlString)
res.show(res.count(),False)
My movies csv file contains a datetime object in column 3 as seen below:
Row(... _c3=datetime.datetime(1995, 10, 30, 0, 0),...)
What i would like to do is using format_date return and show "as Year" only the year from this datetime value.If i just use _c3 in my query it works fine showing the whole datetime value.Passing it to the function though produces this:
|java.util.GregorianCalendar[time=?,areFieldsSet=false,areAllFieldsSet=false,lenient=true,zone=sun.util.calendar.ZoneInfo[id="Europe/Athens",offset=7200000,dstSavings=3600000,useDaylight=true,transitions=138,lastRule=java.util.SimpleTimeZone[id=Europe/Athens,offset=7200000,dstSavings=3600000,useDaylight=true,startYear=0,startMode=2,startMonth=2,startDay=-1,startDayOfWeek=1,startTime=3600000,startTimeMode=2,endMode=2,endMonth=9,endDay=-1,endDayOfWeek=1,endTime=3600000,endTimeMode=2]],firstDayOfWeek=1,minimalDaysInFirstWeek=1,ERA=?,YEAR=1972,MONTH=3,WEEK_OF_YEAR=?,WEEK_OF_MONTH=?,DAY_OF_MONTH=28,DAY_OF_YEAR=?,DAY_OF_WEEK=?,DAY_OF_WEEK_IN_MONTH=?,AM_PM=0,HOUR=0,HOUR_OF_DAY=0,MINUTE=0,SECOND=0,MILLISECOND=0,ZONE_OFFSET=?,DST_OFFSET=?]|Tout Va Bien|0
Could someone explain to me what has happened here and how could i indeed return just the year using the format_date function?