2

Suppose I have documents in this format:

product_name TEXT tags TAG score NUMERIC 

[product1, [tag1, tag2, tag3], 10]
[product2, [tag2, tag3, tag4], 100]
....

I want a query to return the tags in the order of the highest sum of product score and also the top 5 of the products for each tag:

[tag3, 110, [product2, product 1]]
[tag2, 110, [product2, product 1]]
[tag4, 100, [product2]]
[tag1, 10, [product 1]]

What I have so far is storing each product/tag key separately (repeated for each tag) so for each product we have one separate doc for each tag and the id is combination of product name and tag: product_name TEXT tag TAG score NUMERIC. Now I can run an aggregate query to get the list of the top tags:

FT.AGGREGATE product_tags * 
   GROUP BY 1 @TAG 
     REDUCE SUM 1 @score as total_score
   SORT BY 2 @total_score DESC

This will give me the top tags in order but if I want to get top 5 products for each tag I found there is only REDUCE TOLIST 1 @product_name which will return all the products not sorted and there is REDUCE FIRST_VALUE 4 @product_name BY @score DESC which will return only the first top product.

Is there any way to get let's say 5 top products for each tag in one query. If not is it possible to change the document storage format (or add additional one) in a way to make this kind of query possible or with as little queries as possible?

Shouldn't matter but I am using python Redisearch client.

1 Answers1

3

First:

  • Make sure to disable features you won't use (NOOFFSETS, NOHL, NOFREQS, STOPWORDS 0)
  • Use SORTABLE for your NUMERIC score.

Here is the schema I used to test:

FT.CREATE product_tags NOOFFSETS NOHL NOFREQS STOPWORDS 0
    SCHEMA product_name TEXT tags TAG score NUMERIC SORTABLE

You want to think of FT.AGGREGATE as a pipeline.

The first step will be to sort the products by @score, so that later, down in the pipeline, when we REDUCE TOLIST 1 @product_name, the list comes out sorted:

SORTBY 2 @score DESC

I think you are already doing LOAD/APPLY to deal with the tags, as TAG fields would otherwise be grouped by the full comma-separated string tags-list, per product. See Allow GROUPBY on tag fields issue. So our next step is in the pipeline is:

LOAD 1 @tags 
APPLY split(@tags) as TAG 

We then group by @TAG, and apply the two reductions. Our products list will come out sorted.

GROUPBY 1 @TAG
    REDUCE SUM 1 @score AS total_score
    REDUCE TOLIST 1 @product_name AS products

Finally, we sort by @total_score:

SORTBY 2 @total_score DESC

Here a final view of the command:

FT.AGGREGATE product_tags *
    SORTBY 2 @score DESC 
    LOAD 1 @tags 
    APPLY split(@tags) as TAG
    GROUPBY 1 @TAG
        REDUCE SUM 1 @score AS total_score 
        REDUCE TOLIST 1 @product_name AS products
    SORTBY 2 @total_score DESC

Here a full list of commands to illustrate the result. I used productXX with score XX to easily verify visually the sorting of products.

> FT.CREATE product_tags NOOFFSETS NOHL NOFREQS STOPWORDS 0 SCHEMA product_name TEXT tags TAG score NUMERIC SORTABLE
OK
> FT.ADD product_tags pt:product10 1 FIELDS product_name product10 tags tag2,tag3,tag4 score 10
OK
> FT.ADD product_tags pt:product1 1 FIELDS product_name product1  tags tag1,tag2,tag3 score 1
OK
> FT.ADD product_tags pt:product100 1 FIELDS product_name product100 tags tag2,tag3 score 100
OK
> FT.ADD product_tags pt:product5 1 FIELDS product_name product5 tags tag1,tag4 score 5
OK
> FT.SEARCH product_tags *
1) (integer) 4
2) "pt:product5"
3) 1) "product_name"
   2) "product5"
   3) "tags"
   4) "tag1,tag4"
   5) "score"
   6) "5"
4) "pt:product100"
5) 1) "product_name"
   2) "product100"
   3) "tags"
   4) "tag2,tag3"
   5) "score"
   6) "100"
6) "pt:product1"
7) 1) "product_name"
   2) "product1"
   3) "tags"
   4) "tag1,tag2,tag3"
   5) "score"
   6) "1"
8) "pt:product10"
9) 1) "product_name"
   2) "product10"
   3) "tags"
   4) "tag2,tag3,tag4"
   5) "score"
   6) "10"
> FT.AGGREGATE product_tags * SORTBY 2 @score DESC LOAD 1 @tags APPLY split(@tags) as TAG GROUPBY 1 @TAG REDUCE SUM 1 @score AS total_score REDUCE TOLIST 1 @product_name AS products SORTBY 2 @total_score DESC
1) (integer) 4
2) 1) "TAG"
   2) "tag2"
   3) "total_score"
   4) "111"
   5) "products"
   6) 1) "product100"
      2) "product10"
      3) "product1"
3) 1) "TAG"
   2) "tag3"
   3) "total_score"
   4) "111"
   5) "products"
   6) 1) "product100"
      2) "product10"
      3) "product1"
4) 1) "TAG"
   2) "tag4"
   3) "total_score"
   4) "15"
   5) "products"
   6) 1) "product10"
      2) "product5"
5) 1) "TAG"
   2) "tag1"
   3) "total_score"
   4) "6"
   5) "products"
   6) 1) "product5"
      2) "product1"

You are getting the full list of products sorted, not just the top 5. Complexity-wise it makes no difference, we paid the price. The impact is in buffering, network payload, and your client.

You can limit to top 5 using a Lua script:

eval "local arr = redis.call('FT.AGGREGATE', KEYS[1], '*', 'SORTBY', '2', '@score', 'DESC', 'LOAD', '1', '@tags', 'APPLY', 'split(@tags)', 'as', 'TAG', 'GROUPBY', '1', '@TAG', 'REDUCE', 'SUM', '1', '@score', 'AS', 'total_score', 'REDUCE', 'TOLIST', '1', '@product_name', 'AS', 'products', 'SORTBY', '2', '@total_score', 'DESC') \n for i=2,(arr[1]+1) do \n arr[i][6] = {unpack(arr[i][6], 1, ARGV[1])} \n end \n return arr" 1 product_tags 5

Here a friendly view of the Lua script above:

local arr = redis.call('FT.AGGREGATE', KEYS[1], ..., 'DESC')
for i=2,(arr[1]+1) do 
    arr[i][6] = {unpack(arr[i][6], 1, ARGV[1])}
end
return arr

We are passing one key (the index) and one argument (the limit for top products, 5 in your case): 1 product_tags 3.

With this, we limited the impact to buffering only, saved network payload and load on your client.

LeoMurillo
  • 6,048
  • 1
  • 19
  • 34
  • 1
    You are getting the full list of products sorted, not just the top 5. Complexity-wise it makes no difference, we paid the price. The impact is buffering and network payload I guess. – LeoMurillo Dec 31 '19 at 07:00
  • Thanks, this is great! Didn't know about the sorting and splitting of the tags. I wish there was an aggregate function for returning top N. I am concerned about returning the whole list though cause of memory/network consumption because some tags might be in most of the products and it is like returning all product names let's say if it's over 1000000 so it will not be practical. – user12177990 Dec 31 '19 at 23:31
  • @user12177990 added Lua script in the answer to limit to top 5 in the response. Still paying in memory but saving network payload and your client burden. Let's see if with this I get you to accept the answer :-) – LeoMurillo Jan 02 '20 at 07:45
  • Thanks for digging into this but I think my team would not accept a PR with this many hacks so need to find another solution or redesign this. I created an issue here: https://github.com/RediSearch/RediSearch/issues/1035 – user12177990 Jan 02 '20 at 16:40
  • @user12177990 - you're hard to please :-) Lua scripts are not a hack. Lua is to Redis as T-SQL is to SQL Server. So here my last two cents: You can keep a set with the tags, and a sorted set per tag in which every product is added with its score. You can get the score sum per tag as in https://stackoverflow.com/questions/4846243/redis-sum-of-scores-in-sorted-set, the Lua script could walk through each tag, get the sum and first five. Since it does it per tag, you don't overload server memory. – LeoMurillo Jan 02 '20 at 17:29