I have data set of crimes happened from 2001 up till now. I want to calculate no_of_crimes happened per year. My code which i have tried is
val inp = SparkConfig.spark.sparkContext.textFile("file:\\C:\\Users\\M1047320\\Desktop\\Crimes_-_2001_to_present.csv")
val header = inp.first()
val data = inp.filter( line => line(0) != header(0))
val splitRDD = data.map( line =>{
val temp = line.split(",(?![^\\(\\[]*[\\]\\)])")
(temp(0),temp(1),temp(2),temp(3),temp(4),temp(5),
temp(6),temp(7),temp(8),temp(9),temp(10),temp(11),
temp(12),temp(13),temp(14),temp(15),temp(16),temp(17))
})
val crimesPerYear = splitRDD.map( line => (line._18,1)).reduceByKey(_+_)// line._18 represents year column
crimesPerYear.take(20).foreach(println)
Expected Result is,
(2001,54)
(2002,100)
(2003,24) so on
But I am getting result as
(1175860,1)
(1176964,4)
(1178665,123)
(1171273,3)
(1938926,1)
(1141621,8)
(1136278,2)
I am totally confused what i am doing wrong.Why years are summing up ?Please help me