2

I have two key value pair RDD, I join the two rdd's and I saveastext file, here is the code:

val enKeyValuePair1 = rows_filter6.map(line => (line(8) -> (line(0),line(4),line(10),line(5),line(6),line(14),line(1),line(9),line(12),line(13),line(3),line(15),line(7),line(16),line(2),line(14))))

val enKeyValuePair = DATA.map(line => (line(0) -> (line(2),line(3))))

val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)

val output = final_res.saveAsTextFile("C:/out")

my output is as follows:
(534309,((17999,5161,45005,00000,XYZ,,29.95,0.00),None))

How can i get rid of all the parenthesis? I want my output as follows:

534309,17999,5161,45005,00000,XYZ,,29.95,0.00,None
pcs
  • 1,864
  • 4
  • 25
  • 49
Abu
  • 21
  • 3

3 Answers3

1

When outputing to a text file Spark will just use the toString representation of the element in the RDD. If you want control over the format, then, tou can do one last transform of the data to a String before the call to saveAsTextFile.

Luckily the tuples that arise form using the Spark API can be pulled apart using destructuring. In your example I'd do:

val final_res = enKeyValuePair1.leftOuterJoin(enKeyValuePair)
val formatted = final_res.map { tuple =>
  val (f1,((f2,f3,f4,f5,f6,f7,f8,f9),f10)) = tuple
  Seq(f1, f2, f3, f4, f5, f6, f7, f8, f9, f10).mkString(",")
}
formatted.saveAsTextFile("C:/out")

The first val line will take the tuple that is passed into the map function and assign the components to the values on the left. The second line creates a temporary Seq with the fields in the order you want displayed and then invokes mkString(",") to join the fields using a comma.

In cases with fewer fields or you're just hacking away at a problem on the REPL, a slight alternate to the above can also be used by using pattern matching on the partial function passed to map.

simpleJoinedRdd.map { case (key,(left,right)) => s"$key,$left,$right"}}

While that does allow you do make it a single line expression it can throw Exceptions if the data in the RDD don't match the pattern provided, as opposed to the earlier example where the compiler will complain if the tuple parameter cannot be destructured into the expected form.

hayden.sikh
  • 770
  • 4
  • 7
0

You can do something like this:

import scala.collection.JavaConversions._
val output = sc.parallelize(List((534309,((17999,5161,45005,1,"XYZ","",29.95,0.00),None))))
val result = output.map(p => p._1 +=: p._2._1.productIterator.toBuffer += p._2._2)
  .map(p => com.google.common.base.Joiner.on(", ").join(p.iterator))

I used guava to format string but there is porbably scala way of doing this.

abalcerek
  • 1,807
  • 1
  • 22
  • 27
-1

do a flatmap before saving. Or, you can write a simple format function and use it in map. Adding a bit code, just to show how it can be done. function formatOnDemand can be anything

test = sc.parallelize([(534309,((17999,5161,45005,00000,"XYZ","",29.95,0.00),None))])
print test.collect()
print test.map(formatOnDemand).collect()

def formatOnDemand(t):
    out=[]
    out.append(t[0])
    for tok in t[1][0]:
        out.append(tok)
    out.append(t[1][1])
    return out

>>> 
[(534309, ((17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0), None))]
[[534309, 17999, 5161, 45005, 0, 'XYZ', '', 29.95, 0.0, None]]
ayan guha
  • 1,249
  • 10
  • 7