0

I want to compare data in two RDDs. How can I iterate and compare field data in one RDD with field data in another RDD. below Eg files:`

File1 
 f1  f2       f3    f4    f5      f6  f7
 1 Nancyxyz 23456 12:30 NEWYORK 9000 xyz 
 2 ranboxys 12345 12:30 NEWYORK 9000 xyz

 File2
 f1  f2       f3    f4    f5      f6  f7
 2 ranboxys 12345 12:30 NEWYORK 9000 xyz
 1 markalan 23456 12:30 LONDON  7000 xyz 
 3 Loyleeie 45678 12:40 London  9001 abc

In the above both files having 1st 2 records are same but the sequential order is different. Now i want to compare both the rdds and print only differ record i.e,

 File2
 3 Loyleeie 45678 12:40 London  9001 abc

I dont want to get first 2 records in both the rdds because both are same but order is different Can you please explain how to do that with using rdds in scala

I tried somany options like subtract and while loop. but no luck

I just changed in "file2" 2nd record now i want to print 2nd record and 3rd record in file2 and modified fields. I dont know which field is changed , it just compare file1 if it is not matched then print differ records and print in another line what are the fields are changed

Nathon
  • 165
  • 1
  • 4
  • 13
  • Have you tried converting the RDDs to DataFrames and then use the `except` method? – LiMuBei Nov 17 '16 at 13:13
  • @maasg thanks alot for sharing your thouts i got that. It is not getting only 3rd differ record i am getting 2 records in file2 1 Nancyxyz 23456 12:30 NEWYORK 9000 xyz 3 Loyleeie 45678 12:40 London 9001 abc i didnt get it whats wrong in substract function. Is there any other way . – Nathon Nov 17 '16 at 18:19

1 Answers1

3

Assuming that File1 and File2 are of type :RDD[String], following operation will contain all elements in File2 but not in File1

scala> val File1 = spark.sparkContext.textFile("File1.txt")

scala> val File2 = spark.sparkContext.textFile("File2.txt")

scala> File2.subtract(File1).collect
res0: Array[String] = Array(" 3 Loyleeie 45678 12:40 London  9001 abc")

Here name is the 2nd field in the string (trim the space initially)

scala> File2.subtract(File1).map { x => x.split(" ")(2) }.collect
res1: Array[String] = Array(Loyleeie)

if tab is your seperator, replace it accordingly

Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
vdep
  • 3,541
  • 4
  • 28
  • 54
  • @Nathon, probably you should provide the approach that you have tried so far and indicate the area where you are struggling, then we can expand more on top of that. – vdep Nov 22 '16 at 12:48