3

I'm developing a sort of personalized search engine in Ruby on Rails, and I'm currently trying to find best way of sorting results depending on user's record, in real time.

Example: items that are searched for can have tags (separate entities with ids), for example item has tags=[1, 5, 10, 23, 45].

User, on the other hand, may have flagged some tags as of special interest, so say user has tags=[5, 23].

Score used for sorting the results should take into account number of item's tags which are "stared" by the user. For example, item's score would be 50% based on item's attributes and 50% on rank depending on user's (number of tags stared).

One idea was to inject this into sorting function in information retrieval system. But in Sphinx, which I will probably use, it would be very awkward to implement (when user's vector is large). I don't know about Lucene/solr, but they don't seem to have advanced non-text search capabilities that I need anyway (distance, date, time, etc.)

Other option is to retrieve intermediary set from IR system and then process it on application level. However, I'm quite sure that processing 100-1000 records sequentially and then sorting them in Rails would be very slow.

On the other hand, it seems like task that can be easily processed in parallel - divide 1000 records into sets that processed by separate threads and then sorted.

I read about several map reduce implementations, both universal like hadoop and rails-specific like skynet etc., but they to be best suited for large batch jobs, not real time processing (unless I'm mistaken?).

Is there any in-memory, light MR implementation that I could use for this? Or maybe you have some other ideas how to handle it?

(sidenote: I believe this setup is similar to how google news works, from what I understand from "Google News personalization: scalable online collaborative filtering" paper. They match in real time a set of candidate stories with set of clusters to which user belongs to (pre-calculated earlier) to sort stories in personalized way)

Otigo
  • 575
  • 1
  • 5
  • 5

1 Answers1

1

Map/Reduce is great for this kind of thing but you can deal with it in SQL using an intermediate table.

Presumably, you already have tables like this:

users (id, ...)
items (id, ...)
tags (id, ...)
users_tags (user_id, tag_id)
items_tags (item_id, tag_id)

So, why don't you also maintain a table like this:

users_items_tags (user_id, item_id, tag_id)

where each row means "this user and this item share this tag".

Then, your search query is something like this:

  select item_id, count(tag_id) as score
    from users_items_tags
   where user_id = <USER_ID>
group by item_id
order by score desc

When a user adds a tag, users_items_tags is updated like so:

insert into users_items_tags (user_id, item_id, tag_id)
     select <USER_ID>, item_id, <TAG_ID>
       from items_tags
      where tag_id = <TAG_ID>

and likewise when adding a tag to an item. When a tag is removed, just delete on the tag and user/item.

This solution has a few problem cases. If a particular tag is common amongst items then a lot of writes will be performed when a user adds that tag, and vice versa. If a tag is common amongst both items and users then the table will grow very large. You'll have to consider these cases for your particular dataset.

jedediah
  • 1,179
  • 9
  • 20
  • Thx, that's one possibility, however since user may well target to have 20-50% of items recommended to him and number of tags is intentionally limited, this would result in quite massive amount of data. – Otigo Dec 09 '08 at 20:49