When outputing to a text file Spark will just use the toString
representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String
before the call to saveAsTextFile
.
Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:
val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")
The first val
line will take the tuple that is passed into the map
function and assign the components to the values on the left. The second line creates a temporary Seq
with the fields in the order you want displayed and then invokes mkString(",")
to join the fields using a comma.
In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map
.
simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}
While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple
parameter cannot be destructured into the expected form.