1

Is there any way I can convert a pair RDD back to a regular RDD?

Suppose I get a local csv file, and I first load it as a regular rdd

rdd = sc.textFile("$path/$csv")

Then I create a pair rdd (i.e. key is the string before "," and value is the string after ",")

pairRDD = rdd.map(lambda x : (x.split(",")[0], x.split(",")[1]))

I store the pairRDD by using the saveAsTextFile()

pairRDD.saveAsTextFile("$savePath")

However, as investigated, the stored file will contain some necessary characters, such as "u'", "(" and ")" (as pyspark simply calls toString(), to store key-value pairs) I was wondering if I can convert back to a regular rdd, so that the saved file wont contain "u'" or "(" and ")"? Or any other storage methods I can use to get rid of the unnecessary characters ?

malana
  • 5,045
  • 3
  • 28
  • 41

1 Answers1

0

Those characters are the Python representation of your data as string (tuples and Unicode strings). You should convert your data to text (i.e. a single string per record) since you use saveAsTextFile. You can use map to convert the key/value tuple into a single value again, e.g.:

pairRDD.map(lambda (k,v): "Value %s for key %s" % (v,k)).saveAsTextFile(savePath)
user2303197
  • 1,271
  • 7
  • 10
  • Thank you very much for your help. Really understand the structure well from your explanation. I have tried another way like: pairRDD.map(lambda (x,y): (x+","+y)).saveAsTextFile($savePath). This stores a pair rdd as a csv file (sort of converting it back to a regular rdd). – user3569633 Oct 06 '15 at 20:52
  • hello how to do that with java ? – A.HADDAD Jul 11 '18 at 10:35