0

I am trying to JOIN both tables

ON scores.updated_at_yyyy_mm = distributions.range_yyyy_mm

which of course works, but also LIMIT the number of rows returned from the scores table according to 'count' given in the distributions table, which unfortunately I cannot get to work. I tried subqueries, both - SELECT * FROM (SELECT) - and - JOIN(SELECT ...). Ideally, I could also get a random sample of scores - TABLESAMPLE ( ROWS) to the rescue?

I have seen a lot of solutions for MS SQL and Oracle, but could not find anything for Hive (maybe Hive 0.13 can do that - but I have not yet figured out how to upgrade the Hive which is underlying Shark).

Is such a thing possible in Hive (and Shark) at all, or am I going about this the completely wrong way? Many thanks for your help!!

btw - I am on Apache Shark 0.9.1 and followed to instructions to run Shark locally (ie not in a Hadoop/Hive cluster). And my tables are as follows

scores:                                       distributions:
user_id | updated_at_yyyy_mm | score |        range_yyyy_mm | count |
---------------------------------------       -----------------------
000001  | 2014-01            | 100   |          2014-01     |   2   |
000001  | 2014-02            | 103   |          2014-02     |   1   |
000001  | 2014-03            | 106   |          2014-03     |   3   |
000001  | 2014-04            | 102   |          2014-04     |   1   |
 ...    |   ...              | ...   |
000002  | 2014-01            | 107   |
000002  | 2014-02            | 104   |
000002  | 2014-03            | 105   |
000002  | 2014-04            | 105   |
 ...    |   ...              | ...   |
000003  | 2014-01            | 109   |
000003  | 2014-02            | 111   |
000003  | 2014-03            | 113   |
000003  | 2014-04            | 102   |
 ...    |   ...              | ...   |

and this is what I would like to get:

profiles:
user_id | updated_at_yyyy_mm | score |
--------------------------------------
000002  | 2014-01            | 107   |
000003  | 2014-01            | 109   |

000002  | 2014-02            | 104   |

000001  | 2014-03            | 106   |
000002  | 2014-03            | 105   |
000003  | 2014-03            | 113   |

000001  | 2014-04            | 102   |
juwalter
  • 11,472
  • 5
  • 19
  • 18

1 Answers1

0

Wrap distributions table in a subquery with count or row_number calculated, join with scores and calculate row_number as rn (distributed by updated_at_yyyy_mm) and then in the upper subquery filter result rows: where rn<= scores.cnt.

leftjoin
  • 36,950
  • 8
  • 57
  • 116