4

I currently try to prototype a product recommendation system using the Elasticsearch Significant Terms aggregation. So far, I didn't find a good example yet which deals with "flat" JSON structures of sales (here: The itemId) coming from a relational database, such as mine:

Document 1

{
    "lineItemId": 1,
    "lineNo": 1,
    "itemId": 1,
    "productId": 1234,
    "userId": 4711,
    "salesQuantity": 2,
    "productPrice": 0.99,
    "salesGross": 1.98,
    "salesTimestamp": 1234567890
}

Document 2

{
    "lineItemId": 1,
    "lineNo": 2,
    "itemId": 1,
    "productId": 1235,
    "userId": 4711,
    "salesQuantity": 1,
    "productPrice": 5.99,
    "salesGross": 5.99,
    "salesTimestamp": 1234567890
}

I have around 1.5 million of these documents in my Elasticsearch index. A lineItem is a part of a sale (identified by itemId), which can consist of 1 or more lineItems What I would like to receive is the, say, 5 most uncommonly common products which were bought in conjunction with the sale of one specific productId.

The MovieLens example (https://www.elastic.co/guide/en/elasticsearch/guide/current/_significant_terms_demo.html) deals with data in the structure of

{
    "movie": [122,185,231,292,
              316,329,355,356,362,364,370,377,420,
              466,480,520,539,586,588,589,594,616
    ],
    "user": 1
}

so it's unfortunately not really useful to me. I'd be very glad for an example or a suggestion using my "flat" structures. Thanks a lot in advance.

Tobi
  • 31,405
  • 8
  • 58
  • 90
  • Is your `sale` an `object` or `nested`? In any case, have you already tried the obvious: `{ "query": { "filtered": { "filter": { "term": { "sale.productId": 1235 } } } }, "aggs": { "most_sig": { "significant_terms": { "field": "sale.productId", "size": 6 } } } }`? – Andrei Stefan Jun 08 '15 at 21:50
  • If `sale` is `object` (thus a flat array of values) it should work as is. If it's `nested` I think you would need a `"include_in_parent": true` and use the same query. – Andrei Stefan Jun 08 '15 at 22:17
  • @AndreiStefan Thanks a lot for your comments. Unfortunately, the `lineItems` are neiter nested nor in a parent-child object structure. The documents are in one index as described above. I understand I'd need to aggregate based on the `itemId`, because I want the products which have been uncommonly common bought together in one `sale` (`itemId`). – Tobi Jun 09 '15 at 07:46
  • Something is still not clear: same `itemId` means a bundle of products that have been bought together, right? Bundle of products means different `productId` for the same `itemId`. And given one `productId` you want to find the uncommonly common `productId`s different from the initial `productId` that were bought together. Do I understand this right? – Andrei Stefan Jun 09 '15 at 14:55
  • @AndreiStefan That's remarkably correct :-) Yes! – Tobi Jun 09 '15 at 15:14
  • Since I don't have the amount of data that you do, try this: **1.** get the list of `itemId`s for bundles that contain a certain `productId` that you want to find "stuff" for: `{ "query": { "filtered": { "filter": {"term": { "productId": 1234 }} } }, "fields": ["itemId"] }`. – Andrei Stefan Jun 09 '15 at 15:25
  • Then **2.** using this list create this query: `GET /sales/sales/_search?search_type=count { "query": { "filtered": { "filter": { "terms": { "itemId": [your_itemIDs_here_separated_by_commas] } } } }, "aggs": { "most_sig": { "significant_terms": { "field": "productId", "size": 0 } } } }` – Andrei Stefan Jun 09 '15 at 15:26
  • @AndreiStefan Thanks a lot for your suggestion. I think there is a slight problem because there can be potentially (tens of) thousands of 'itemId's returned from the first query. I think ES has a standard of 1024 terms, but it's configurable: http://stackoverflow.com/questions/26642369/max-limit-on-the-number-of-values-i-can-specify-in-the-ids-filter-or-generally-q – Tobi Jun 09 '15 at 15:35
  • If you can give it a try I'd be curious. Either way, you would need that list of IDs I think. – Andrei Stefan Jun 09 '15 at 16:36
  • Did you get the chance to test this? – Andrei Stefan Jun 10 '15 at 21:00
  • Not yet, sorry... I hope I will have time today. – Tobi Jun 11 '15 at 08:10

3 Answers3

1

It sounds like you're trying to build an item-based recommender. Apache Mahout has tools to help with collaborative filtering (formerly the Taste project).

There is also a Taste plugin for Elasticsearch 1.5.x which I believe can work with data like yours to produce item-based recommendations.

(Note: This plugin uses Rivers which were deprecated in Elasticsearch 1.5, so I'd check with the authors about plans to support more recent versions of Elasticsearch before adopting this suggestion.)

Peter Dixon-Moses
  • 3,169
  • 14
  • 18
0

Since I don't have the amount of data that you do, try this:

  1. get the list of itemIds for bundles that contain a certain productId that you want to find "stuff" for:
{
  "query": {
    "filtered": {
      "filter": {
        "term": {
          "productId": 1234
        }
      }
    }
  },
  "fields": [
    "itemId"
  ]
}

Then

  1. using this list create this query:
GET /sales/sales/_search?search_type=count
{
  "query": {
    "filtered": {
      "filter": {
        "terms": {
          "itemId": [1,2,3,4,5,6,7,11]
        }
      }
    }
  },
  "aggs": {
    "most_sig": {
      "significant_terms": {
        "field": "productId",
        "size": 0
      }
    }
  }
}
Andrei Stefan
  • 51,654
  • 6
  • 98
  • 89
0

If I understand correctly you have a doc per order line item. What you want is a single doc per order. The Order doc should have an array of productIds (or an array of line item objects that each include a productId field).

That way when you query for orders containing product X the sig_terms aggregation should find product Y is found to be uncommonly common in these orders.

MarkH
  • 823
  • 6
  • 10