1

So here is the data. Structure explained:

CREATE TABLE products ( product_id int(11) NOT NULL AUTO_INCREMENT, product_category_id int(11) NOT NULL,
product_name varchar(45) NOT NULL, product_description varchar(255) NOT NULL, product_price float NOT NULL,
product_image varchar(255) NOT NULL, PRIMARY KEY (product_id) ) ENGINE=InnoDB AUTO_INCREMENT=1346 DEFAULT CHARSET=utf8 |

Updated: my environment is Spark 1.6.2 and Scala 2.10.5

I want to get an RDD sorted by product_name asc, product_price desc.

I know how to sort the RDD as both asc:

val p = sc.textFile("products")
val fp = p.filter(r=>r.split(",")(4) !="")
val mfp = fp.map(r=>(r.split(",")(4).toFloat, r)).sortByKey(false).map(r=> (r._2.split(",")(4), r._2.split(",")(2))

Now I have the two fields only: product_price and product_name.

I can do the sorting:

mfp.sortBy(r=>(r._1, r._2))

Which gives me the result of sorted by name, and then by price, both in asc;

(10,159.99)
(10,159.99)
(10,169.99)
(10,1799.99)
(10,189.0)
(10,199.98)
(10,199.99)
(10,199.99)
(10,1999.99)
(10,269.99)

What I need is (product_category_id, product_name, product_price), sorted by product_category_id in asc, and then product_price desc.

And I only want the top 3 products per product_category_id.

Choix
  • 555
  • 1
  • 12
  • 28

2 Answers2

0

You almost got it. RDD sortBy always sort in ascending order. Even though, your field is a number, so you can get a reverse ordering just by multiplying it by -1. I used a case class to make your code a bit more readable.

case class Info(product_category_id: Int, product_name: String, product_price: Double)

val rdd = sc.textFile("products").map(line => line.split(",")).filter(!_.isEmpty)

val infos = rdd.map { split =>
    Info(
        product_category_id = split(1),
        product_name = split(2),
        product_price = split(4)
    )
}

val sorted = infos.sortBy(info => (info.product_category_id, -info.product_price))

Bad news is, this way you're not grouping by product_category_id, so you can't get the top 3 products per category. Luckily, Spark mlib provides a method to do exactly what you want, implemented using a bounded priority queue.

import org.apache.spark.mllib.rdd.MLPairRDDFunctions._
val keyByCategory = infos.keyBy(_.product_category_id)
val topByKey: RDD[(Int, Array[Info])] = keyByCategory.topByKey(3)(Ordering.by(-_.product_price))
val topWithKeysSorted = topByKey.sortBy(_._1)
Miguel
  • 1,201
  • 2
  • 13
  • 30
  • Thank you very much Miguel, filterNot is not recognized here – Choix Mar 17 '18 at 23:55
  • I didn't remember it's not present in the RDD API. You can use a normal filter then. – Miguel Mar 18 '18 at 00:02
  • Yes I already did that, however topByKey throws error here: value topByKey is not a member of org.apache.spark.rdd.RDD. I updated the original question to indicate I am doing this in Spark 1.6.2 and Scala 2.10.5 – Choix Mar 18 '18 at 00:12
  • You need to import the mllib class, and probably to add the dependency – Miguel Mar 18 '18 at 00:34
  • I did import the org.apache.spark.rdd.RDD and yet received that error. – Choix Mar 18 '18 at 01:48
  • It's not `org.apache.spark.rdd.RDD`, it is `org.apache.spark.mllib.rdd.MLPairRDDFunctions`. The class adds a implicit conversions for key-value RDDs (`RDD[(A, B)] `) – Miguel Mar 18 '18 at 09:21
  • Thank you Miguel, but I have followed your suggestion. Kindly check out here: https://github.com/mdivk/175Scala/blob/master/video%20code/topByKey_question.scala – Choix Mar 18 '18 at 13:06
  • My bad, you have to import all the members of `MLPairRDDFunctions`: `import org.apache.spark.mllib.rdd.MLPairRDDFunctions._`. Also, keep in mind that if you call `topByKey` without the second group of parameters, the compiler will try to find an `Ordering` for your type. If you're working with tuples, that means it will sort all the tuple fields in ascending order. You'll need the ordering to sort by price in descending order. – Miguel Mar 18 '18 at 16:36
  • Thank you very much for pointing that out, what I get now is: (1,[Lscala.Tuple4;@1788be37), first: (Int, Array[(Int, Int, String, Float)]) = (1,Array((1,2,Quest Q64 10 FT. x 10 FT. Slant Leg Instant U,59.98))), can you help to get all the info extracted in the final result so as to be inspected? Thank you. – Choix Mar 18 '18 at 20:45
  • `topByKey` is returning you a `RDD[(Int, Array[(Int, Int, String, Float)])]`. This is, a Tuple2 containing the product category id and an array of the top 3 products by price. You can flatMap it to get the structure you want. `val flat = topByKey.flatMap { case (category, top3) => top3.map(top => (category, top._1,...)) }`. Finally, you can write it back to disk `flat.saveAsTextFile("...")`. Keep in mind that this will write multiple files. If the number of categories is not too big, you can also get the data into the driver with `flat.collect()` and check them out. – Miguel Mar 18 '18 at 21:36
  • Thank you Miguel, there might be something missing in the whole procedure, the result is not correct. I have updated my finding and posted in here: https://github.com/mdivk/175Scala/blob/master/video%20code/topByKey_question.scala – Choix Mar 18 '18 at 21:45
  • The result variable is wrong. Your pattern matching with `Array` is only extracting the first element of the array. Keep in mind that a `map` gets a row of the RDD and returns another row (always). To unroll the array we need a flatMap (which return 0 or more rows from the given one). `topWithKeysSorted.flatmap { case (k, top3) => top3.map { case (v1, v2, v3, v4) => (k, v1, v2, v3, v4) } }`. The flatMap function is returning the 3 rows per category, which come from the `top3.map(...)`. – Miguel Mar 18 '18 at 22:05
  • By the way, `topByKey` gives you the top N elements, but they're not guaranteed to be sorted, so you can sort them in this `flatMap`. – Miguel Mar 18 '18 at 22:10
  • I trust your suggestion will for sure lead to a solution I want, but it is really a bit confuse here with the pure RDD solution, thank you anyway. – Choix Mar 20 '18 at 01:17
0

I have found an easier (to me) solution with SparkSQL, here are the full script:

val productsRDD = sc.textFile("products").filter(x=>x.split(",")(4)!="")
val productsRDDmap = productsRDD.map(a => (a.split(",")(0).toInt, a.split(",")(1).toInt, a.split(",")(2), a.split(",")(4).toFloat))
val productsRDDmapDF = productsRDDmap.toDF("product_id","product_category_id", "product_name", "product_price")
productsRDDmapDF.registerTempTable("products")
val query = """
select  product_id , product_category_id, product_price,  row_number() over(partition by product_category_id  order by product_price  desc) as p_order from products  """
val result = sqlContext.sql(query)
result.where("p_order<=3").show(200)
Choix
  • 555
  • 1
  • 12
  • 28