1

I have data set of crimes happened from 2001 up till now. I want to calculate no_of_crimes happened per year. My code which i have tried is

val inp = SparkConfig.spark.sparkContext.textFile("file:\\C:\\Users\\M1047320\\Desktop\\Crimes_-_2001_to_present.csv")
val header = inp.first()
val data   = inp.filter( line => line(0) != header(0))

val splitRDD = data.map( line =>{
  val temp = line.split(",(?![^\\(\\[]*[\\]\\)])")
  (temp(0),temp(1),temp(2),temp(3),temp(4),temp(5),
  temp(6),temp(7),temp(8),temp(9),temp(10),temp(11),
  temp(12),temp(13),temp(14),temp(15),temp(16),temp(17))
})

val crimesPerYear = splitRDD.map( line => (line._18,1)).reduceByKey(_+_)// line._18 represents year column
crimesPerYear.take(20).foreach(println)

Expected Result is,

(2001,54)
(2002,100)
(2003,24) so on

But I am getting result as

 (1175860,1)
 (1176964,4)
 (1178665,123)
 (1171273,3)
 (1938926,1)
 (1141621,8)
 (1136278,2)

I am totally confused what i am doing wrong.Why years are summing up ?Please help me

Niketa
  • 453
  • 2
  • 9
  • 24
  • 1
    Are you sure `line._18` will give you the correct values, i.e. years? Also, have you considered using the newer DataFrame API? It's easier to use and makes for clearer code. – Shaido Sep 18 '18 at 06:14
  • line._18 gives year only.I know dataframe api's provide easier ways to execute queries but I have to try with RDD – Niketa Sep 18 '18 at 06:19
  • 2
    Can u share sample data ? – Balaji Reddy Sep 18 '18 at 06:26
  • @Niketa: I see. The line with `reduceByKey` looks correct however, maybe you can check so that the result of `splitRDD.map( line => (line._18,1))` looks as expected once more. – Shaido Sep 18 '18 at 06:28
  • splitRDD.map( line => (line._18,1)) is giving the correct output (2004,1)(2004,1)(2002,1)(2001,1) so on – Niketa Sep 18 '18 at 08:02

0 Answers0