0

Let's say I have 2 rdds : the first rdd is composed of strings which are html requests :

rdd1 :

serverIP:80 clientIP1 - - [10/Jun/2016:10:47:37 +0200] "GET /path/to/page1 [...]"
serverIP:80 clientIP2 - - [11/Jun/2016:11:25:12 +0200] "GET /path/to/page2 [...]"
...

The second rdd is simply integers :

rdd2 :

0.025
0.56
...

I would like to concatenate the string lines by lines in order to obtain a third rdd like this : rdd3 :

serverIP:80 clientIP1 - - [10/Jun/2016:10:47:37 +0200] "GET /path/to/page1 [...]" 0.025
serverIP:80 clientIP2 - - [11/Jun/2016:11:25:12 +0200] "GET /path/to/page2 [...]" 0.56
...

By the way, this job is a streaming job. It's to say, I don't want to store permanently the data in some kind of sql table or something else.

Any idea on how to tackle this ?

Thanks in advance !

EDIT : For people trying to join Dstream and not rdd, have a look at this : How to Combine two Dstreams using Pyspark (similar to .zip on normal RDD)

Community
  • 1
  • 1
Robin Dupont
  • 339
  • 1
  • 2
  • 12

1 Answers1

1

If you can rely on the sequence of the two rdd's to match you can use zip:

val rdd1 = sc.parallelize(List("a", "b", "c"))
val rdd2 = sc.parallelize(List(1.1, 1.2, 1.3))

val rdd3 = rdd1.zip(rdd2).map({case (s, d) => s + " " + d})

rdd3.collect() foreach println

// a 1.1
// b 1.2
// c 1.3
osiris42
  • 56
  • 3
  • Many thanks for your answer ! No doubt that it should work, but unfortunately I discovered that I use Dstream and not RDD. Do you have any hints on how to adapt you method ? I am trying to extract the rdds from the Dstream currently ! – Robin Dupont Jul 11 '16 at 10:16