1

I have a table that contains a column with categorial data (let's call it 'category') and another column that contains arbitrary strings (let's call it 'text').

Now i would like to see few examples of text for every category, in other words: For each category show first 5 values of text. The text values don't need to be sorted by any criteria (i don't want the top 5 longest strings or anything like that), just the first encountered (random) ones.

The result should look something like this:

+----------+------------------+
| category | text             |
+----------+------------------+
| cat A    | random string 1  |
| cat A    | random string 2  |
| cat A    | random string 3  |
| cat A    | random string 4  |
| cat A    | random string 5  |
| cat B    | random string 6  |
| cat B    | random string 7  |
| cat B    | random string 8  |
| cat B    | random string 9  |
| cat B    | random string 10 |
| cat C    | random string 11 |
| ...      | ...              |

I cannot afford quadratic complexity, since the table is really huge, tens of milions of rows.

I also would like to avoid unioning subqueries for each category manually

select ... where category = catA
union 
select ... where category = catB
union
select ... where category = catC
union
...

because there is like 50 different categories.

And to the top of it, i cannot use any database-specific featues, because my database isn't any of the traditional relational databases, but an Apache Hive, that only supports basic SQL syntax.

A lot of strict requirements, i know. But if someone knew how to solve this in plain SQL, that would help me a lot. Thanks.

Youda008
  • 1,788
  • 1
  • 17
  • 35
  • @TimBiegeleisen This is NOT a duplicate. All those other questions ask for TOP 10 entries, that means ordered by certain criteria, while i require arbitrary 10 entries. Their solution requires sorting the results for each category, which increases complexity. I explain sufficiently in the question how it is different from the previous questions, please remove the duplicate mark. – Youda008 Sep 18 '19 at 11:25
  • @Youda008 Try Gordon's answer but use `ORDER BY RAND()` inside the call to `ROW_NUMBER`. I didn't see that you wanted random entries, nor did I see that you are using Hive, sorry. – Tim Biegeleisen Sep 18 '19 at 12:58

2 Answers2

3

You can use row_number() in most databases include Hive. For 10 examples per category, for instance:

select t.*
from (select t.*,
             row_number() over (partition by category order by category) as seqnum
      from t
     ) t
where seqnum <= 10;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • The OP seems to want `ORDER BY RAND()`. – Tim Biegeleisen Sep 18 '19 at 12:58
  • 1
    @TimBiegeleisen . . . "The text values don't need to be sorted by any criteria (i don't want the top 5 longest strings or anything like that), just the first encountered (random) ones." The OP appears to want arbitrary values, but `rand()` would definitely be acceptable. – Gordon Linoff Sep 18 '19 at 13:09
  • This one worked and it is sufficiently fast, thanks. – Youda008 Sep 19 '19 at 13:42
1

The general answer is: it cannot be done without support for some sort of procedural language and optimised GROUP BY operations.

If a database system supports fast SELECT category FROM thetable GROUP BY category, and a procedural language to loop over the values, then you can loop over the result of the GROUP BY and add the result of SELECT * FROM thetable WHERE category = categoryvalue LIMIT 5 in each iteration.

SQL support in Apache Hive includes LIMIT in unsorted SELECT statements.

There is also HPL/SQL http://www.hplsql.org/udf-sproc for SQL CREATE PROCEDURE.

fredt
  • 24,044
  • 3
  • 40
  • 61