Get top x% rows for every unique value in column by other column value

Question

Table "tags":

Source  Target      Weight
#003    blitzkrank  0.83
#003    deutsch     0.7
#003    brammen     0.57
#003    butzfrauen  0.55
#003    solaaaa     0.5
#003    moments     0.3
college scandal     1.15
college prosecutors 0.82
college students    0.41
college usc         0.33
college full house  0.17
college friends     0.08
college house       0.5
college friend      0.01

The table has 5.600.000 rows and ~91.000 unique entries in column "Source".

For every unique value in "Source" and "Target" I need the top x% rows (e.g. top 20%, top 30%, needs to be variable) by Weight (table is sorted by "Source" (ascending) and "Weight" (descending).

If rows have the same "Weight" take rows in alphabetical order.
If x% == 0 take at least one row.

Since there will be duplicates (e.g. "Source = "college" will yield at least one duplicate row as "Target" = "scandal") duplicated entries should be removed if possible. Otherwise not a big deal.

Calculation for "Source":

6 rows where Source = "#003", 6 * 0.2 = 1.2 = take 1 row
8 rows where Source = "college", 8 * 0.2 = 1.6 = take 2 rows

Needed result table for "Source":

Source  Target      Weight
#003    blitzkrank  0.83
college scandal     1.15
college prosecutors 0.82

How can I do that in SQL in an SQLite database?

"For every unique value in "Source" and "Target" I need the top x% rows " I don't understand what you mean. Do you mean for the "pair" of values? Can a row be repeated, once for source and once for target? — Gordon Linoff, Nov 27 '19 at 12:31
A row can be repeated once for source and once for target, yes. — Vega, Nov 27 '19 at 13:40

score 1 · Accepted Answer · answered Nov 27 '19 at 12:32

1

If you want a sample by source:

select t.*
from (select t.*,
             row_number() over (partition by source order by weight desc, target) as seqnum,
             count(*) over (partition by source) as cnt
      from t
     ) t
where seqnum = 1 or  -- always at least one row
      seqnum <= round(cnt * 0.2);

Based on your example, I think this is what you want. You can construct a similar query for target.

answered Nov 27 '19 at 12:32

Gordon Linoff

1,242,037
58
646
786

This gives the needed result, thank you very much. Is there a way to to the same for Column "Target" in the same SQL query so that it gets the rows for both in one run? – Vega Nov 27 '19 at 15:00
1

@Vega you can use a `UNION` of two queries (which also removes duplicates). – Shawn Nov 27 '19 at 15:11
Good to know, UNION takes 5,71 minutes, great speed up (took 1-2 days) compared to my solution in Python/Pandas. Thanks a lot again! – Vega Nov 27 '19 at 15:34
@Vega Adding indexes on the source and target columns (If they don't already exist) should help speed that up if it's something you need to run frequently. Or see the [sqlite3 shell's .expert command](https://www.sqlite.org/cli.html#index_recommendations_sqlite_expert_) for detailed index recommendations. – Shawn Nov 27 '19 at 16:36

Get top x% rows for every unique value in column by other column value

1 Answers1