I am trying to JOIN both tables
ON scores.updated_at_yyyy_mm = distributions.range_yyyy_mm
which of course works, but also LIMIT the number of rows returned from the scores table according to 'count' given in the distributions table, which unfortunately I cannot get to work. I tried subqueries, both - SELECT * FROM (SELECT) - and - JOIN(SELECT ...). Ideally, I could also get a random sample of scores - TABLESAMPLE ( ROWS) to the rescue?
I have seen a lot of solutions for MS SQL and Oracle, but could not find anything for Hive (maybe Hive 0.13 can do that - but I have not yet figured out how to upgrade the Hive which is underlying Shark).
Is such a thing possible in Hive (and Shark) at all, or am I going about this the completely wrong way? Many thanks for your help!!
btw - I am on Apache Shark 0.9.1 and followed to instructions to run Shark locally (ie not in a Hadoop/Hive cluster). And my tables are as follows
scores: distributions:
user_id | updated_at_yyyy_mm | score | range_yyyy_mm | count |
--------------------------------------- -----------------------
000001 | 2014-01 | 100 | 2014-01 | 2 |
000001 | 2014-02 | 103 | 2014-02 | 1 |
000001 | 2014-03 | 106 | 2014-03 | 3 |
000001 | 2014-04 | 102 | 2014-04 | 1 |
... | ... | ... |
000002 | 2014-01 | 107 |
000002 | 2014-02 | 104 |
000002 | 2014-03 | 105 |
000002 | 2014-04 | 105 |
... | ... | ... |
000003 | 2014-01 | 109 |
000003 | 2014-02 | 111 |
000003 | 2014-03 | 113 |
000003 | 2014-04 | 102 |
... | ... | ... |
and this is what I would like to get:
profiles:
user_id | updated_at_yyyy_mm | score |
--------------------------------------
000002 | 2014-01 | 107 |
000003 | 2014-01 | 109 |
000002 | 2014-02 | 104 |
000001 | 2014-03 | 106 |
000002 | 2014-03 | 105 |
000003 | 2014-03 | 113 |
000001 | 2014-04 | 102 |