So here is the data. Structure explained:
CREATE TABLE
products
(product_id
int(11) NOT NULL AUTO_INCREMENT,product_category_id
int(11) NOT NULL,
product_name
varchar(45) NOT NULL,product_description
varchar(255) NOT NULL,product_price
float NOT NULL,
product_image
varchar(255) NOT NULL, PRIMARY KEY (product_id
) ) ENGINE=InnoDB AUTO_INCREMENT=1346 DEFAULT CHARSET=utf8 |
Updated: my environment is Spark 1.6.2 and Scala 2.10.5
I want to get an RDD sorted by product_name
asc, product_price
desc.
I know how to sort the RDD as both asc:
val p = sc.textFile("products")
val fp = p.filter(r=>r.split(",")(4) !="")
val mfp = fp.map(r=>(r.split(",")(4).toFloat, r)).sortByKey(false).map(r=> (r._2.split(",")(4), r._2.split(",")(2))
Now I have the two fields only: product_price
and product_name
.
I can do the sorting:
mfp.sortBy(r=>(r._1, r._2))
Which gives me the result of sorted by name, and then by price, both in asc;
(10,159.99)
(10,159.99)
(10,169.99)
(10,1799.99)
(10,189.0)
(10,199.98)
(10,199.99)
(10,199.99)
(10,1999.99)
(10,269.99)
What I need is (product_category_id, product_name, product_price)
, sorted by product_category_id
in asc, and then product_price
desc.
And I only want the top 3 products per product_category_id
.