1

First, let me write the part of the code I want to execute in .scala file on spark.

This is my source file. It has structured data with four fields

val inputFile = sc.textFile("hdfs://Hadoop1:9000/user/hduser/test.csv")

I have declared a case class to store the data from file into table with four columns

case class Table1(srcIp: String, destIp: String, srcPrt: Int, destPrt: Int)

val inputValue = inputFile.map(_.split(",")).map(p => Table1(p(0),p(1),p(2).trim.toInt,p(3).trim.toInt)).toDF()

inputValue.registerTempTable("inputValue")

Now, let's say, I want to run following two queries. How can I run these queries in parallel as they are mutually independent. I feel, if I could run them in parallel, it can reduce the execution time. Right now, they are executed serially.

val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
primaryDestValues.registerTempTable("primaryDestValues")
val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
primarySrcValues.registerTempTable("primarySrcValues")

primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show(
amk
  • 330
  • 1
  • 6
  • 15

4 Answers4

1

May be you can look in direction of Futures/Promises. There is a method in SparkContext submitJob which return you future with results. So may this you can fire two jobs and then collect results from futures.

I have not tried this method yet. Just an assumption.

Zeke Fast
  • 500
  • 5
  • 12
  • I have tried but couldn't get the return value from future. Will try to work more and if God willingly, successful, will respond. – amk Mar 18 '16 at 16:35
  • @Ahmad After firing two or more jobs and receiving Futures back what you want to do is probably get aggregated future or map results of those jobs. The most correct way to do it is do define handlers on the resulted chain of Futures for success and failure cases or you can just map results or use for. Please check [this thread](http://stackoverflow.com/questions/16256279/how-to-wait-for-several-futures) for more details about joining futures. – Zeke Fast Mar 19 '16 at 23:03
1

No idea why you want to use sqlContext in the first place, and don't keep things simple.

val inputValue = inputFile.map(_.split(",")).map(p => (p(0),p(1),p(2).trim.toInt,p(3).trim.toInt))

Assuming p(0) = destIp, p(1)=srcIp

val joinedValue = inputValue.map{case(destIp, srcIp, x, y) => (destIp, (x, y))}
                  .join(inputFile.map{case(destIp, srcIp, x, y) => (srcIp, (x, y))})
                  .map{case(ip, (x1, y1), (x2, y2)) => (ip, destX, destY, srcX, srcY)}

Now it will be parallezied, and you can even control number of partitions you want using colasce

Abhishek Anand
  • 1,940
  • 14
  • 27
  • I am using sql context because, I have many complex queries for calculating median, average, .. etc. So, I think it is easier to achieve using SQL. The two queries in the question are part of many queries. – amk Mar 18 '16 at 16:32
  • I won't suggest to do that, unless its absolutely necessary. Idea is simple, when you use SQL, you add extra layer for spark to process, which is overhead. Anyways, coming to your question. Spark can do only one task at a time, and that that is invoked along with all its chained operation when you reach a trigger function in your case show. I think the core reason of your queries not running parallel is, sqlContext is threadsafe. – Abhishek Anand Mar 19 '16 at 04:28
0

You can skip the two DISTINCT and do one at the end:

inputValue.select($"srcIp").join(
  inputValue.select($"destIp"), 
  $"srcIp" === $"destIp"
).distinct().show
David Griffin
  • 13,677
  • 5
  • 47
  • 65
  • Thanks for answering, but this was just an example. I need to know the method for executing queries in parallel. – amk Mar 17 '16 at 23:35
  • You can't really. You can join them, that's about it. That's your option. A join, one way or another. – David Griffin Mar 17 '16 at 23:39
  • I have read about "Future" method but I didn't understand it properly. I think it can be used to run queries in parallel. – amk Mar 17 '16 at 23:45
0

That's a nice question. This can be executed in parallel using the par in the array. For this you have customize your code accordingly.

Declare an array with two items in it (your can name this as per your wish). Write your code inside each case statement which you need to execute in parallel.

Array("destIp","srcIp").par.foreach { i => 
{
    i match {
      case "destIp" => {
        val primaryDestValues = sqlContext.sql("SELECT distinct destIp FROM inputValue")
        primaryDestValues.registerTempTable("primaryDestValues")
      }
      case "srcIp" => {
        val primarySrcValues = sqlContext.sql("SELECT distinct srcIp FROM inputValue")
        primarySrcValues.registerTempTable("primarySrcValues")
      }}}
}

Once both of the case statement's execution is completed, your below code will be executed.

primaryDestValues.join(primarySrcValues, $"destIp" === $"srcIp").select($"destIp",$"srcIp").show()

Note : If you remove par from the code, it will run sequentially

The other option is to create another sparksession inside the code and execute sql using that sparksession variable. But this is little risky and has be used very carefully

Sarath Subramanian
  • 20,027
  • 11
  • 82
  • 86