4

EDIT: I know it's been over a year, but I finally got something new to this problem. To see an update for this look at this question: Rails 3 user matching-algorithm to SQL Query (COMPLICATED)

I'm working on a site where users are matched based on answered questions.

The match percentage is calculated each time a user, for example, visits another users profile page. So the matching percentage is not stored in the database and is recalculated all the time.

Now I want to build in a search where users can search for their best match.

The question I have is, what is the most efficient way to do this?

What if I have 50k users and I have to list them ordered by match percentages. Do I have to calculate each matching percentage between one and the other 50k users and then create a list out of that? Sounds kind of inefficient to me. Wouldn't that slow down the app drastically?

I hope someone can help me with this, because this gives me kind of a headache.

EDIT: To clear things up a bit, here is my database model for user, questions, answers, user_answers and accepted_answers:

    Tables:
    Users(:id, :username, etc.)
    Questions(:id, :text)
    Answers(:id, :question_id, :text)
    UserAnswers(:id, :user_id, :question_id, :answer_id, :importance)
    AcceptedAnswers(:id, :user_answer_id, :answer_id)

    Questions <-> Answers: one-to-many
    Questions <-> UserAnswers: one-to-many
    Users <-> UserAnswers: one-to-many
    UserAnswers <-> AcceptableAnswers: one-to-many

So there is a list of Questions(with possible answers to this question) and Users give their "UserAnswers" to those questions, assign how important that question is to them and what answers they accept from other users.

Then if you take User1 and User2, you look for common answered questions, so UserAnswers where the question_id is the same. They have 10 questions in common. User1 gave the importance value 10 to the first five questions and the importance value 20 to the other five. User 2 gave acceptable answers to two 20 value and three 10 value questions. A total of 70 points. The highest reachable pointscore is of course 20x5 + 10x5... So User2 reached 70/150 * 100 = 46,66% ... The same thing is done the other way around for how much User1 reached of User2's assigned points to those questions. Those 2 percentages are then combined through the geometric mean: sqrt of percentage1 * percentage2 ... this gives the final match percentage

Community
  • 1
  • 1
Mexxer
  • 1,699
  • 5
  • 27
  • 40
  • This is a complicated Information Retrieval problem. Calculating the matching percentage every time will probably never be fast enough. Examining exactly how you are calculating the matching percentage would help a great deal. – Alex Aug 19 '11 at 18:55
  • it's calculated like this: a user answers a question and chooses what answers he accepts from other users, and how important that question is to him. The importances have different kind of values. Now there is User1 and User2. User1 answered 10 questions User2 also answered. But User1 only gave 5 answers that were acceptable for User2. The importance values User 2 gave those 5 questions are added and divided by the importance values of all 10 common questions multiplied by 100 ... so a simple percentage calculation. The same is also done the other way around. – Mexxer Aug 19 '11 at 19:08
  • That gives me 2 percentage numbers for User1 -> User2 and User1 <- User2 ... the geometric mean is applied to both which gives me one average match percentage for 2 users – Mexxer Aug 19 '11 at 19:10
  • Are you storing the answers to the questions in a database? – Alex Aug 19 '11 at 19:11
  • Yup everything is stored in the database ... except the calculation of course – Mexxer Aug 19 '11 at 19:12
  • 1
    If I were you I would create an index which will store the percentage relation of every user to every other user. Create a seperate applicaiton that runs every 5 minutes and recreates this index, then in your rails code just simply sort the list and return the results. I do not see a solution that will allow you to calculate all the percentages "on the fly" – Alex Aug 19 '11 at 19:14
  • that would result into 2,5 billion percentages that are updated every 5 minutes. Doesn't sound that prettier haha – Mexxer Aug 19 '11 at 19:20
  • 1
    Your rails app will be much faster. Your sorting 50,000 percentages, not calculating 50,000 percentages. You only need to recalculate percentages when things have been updated. So 2.5 billion for the first time, then very few later on. – Alex Aug 19 '11 at 19:26
  • That's true ... that could actually work. Thanks, I'm gonna try that out. – Mexxer Aug 19 '11 at 19:28
  • 1
    Very interesting problem you have. I was thinking about the solution and got further interested. As far as I can think, you can not have final percentages stored somewhere, as when ever a new question or answer comes in you would not be able to change the percentages directly(involves geometric mean). You would first need to change the sum of importance and recalculate the percentages from there on. I have some half baked sqls, but far from an acceptable solution. Keep us posted! – rubish Aug 19 '11 at 21:25
  • 2
    @Mexxer: gisted a probable solution mostly doing all calculations in db at https://gist.github.com/1158234 – rubish Aug 19 '11 at 22:56
  • @rubish: updated my post whith my model and calculation explanation. – Mexxer Aug 20 '11 at 14:31
  • How do you maintain which answer is from which user? probably there should be a user_id in UserAnswer – rubish Aug 21 '11 at 07:19
  • oh yea right, I forgot to put that in – Mexxer Aug 21 '11 at 12:14
  • @Mexxer updated the gist. Let me know what do you think. – rubish Aug 21 '11 at 13:54
  • Took me some time to get through your code haha. It's been a while since I've looked at more complicated SQL code. But this actually looks really good. Letting the database do the calculation is probably alot faster. But I wonder if it's still fast enough to execute a search for the best match if there are 50k users and upwards. – Mexxer Aug 21 '11 at 14:37
  • If I store the final percentages in the database I probably user more CPU than your solution, but the search is probably alot faster. – Mexxer Aug 21 '11 at 14:48
  • Or does this really slow the performance THAT much down, if I'm storing the final percentages and letting them calculate indirectly? Sadly this is my first big scale project and I'm not an expert on database performance issues. – Mexxer Aug 21 '11 at 14:58
  • @Mexxer: You would first need to try out the queries with real data, It might perform okay, or it might be slow. In case it is slow, first priority should be to optimize the query using indexes, if that doesn't help, one should look at changing the data structure for better performing queries and easily indexable data. If that fails, you have to find other alternatives. – rubish Aug 21 '11 at 15:13
  • @Mexxer let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/2714/discussion-between-rubish-and-mexxer) – rubish Aug 21 '11 at 15:17

2 Answers2

1

I suggest you keep the match percentage of all the users in your database. Create a table matches that has match percentage for a pair of users. You do not need to save match percentage for all the pairs of users in your database. A valid match percentage is calculated for two users only when any one of have them has accepted an answer from other user. Most of the users will not accept the answers of most of other users.

I will suggest you to calculate and save the match percentage not at the time when a user visits another users profile. But when a user accepts another users answers. This will make sure that you do not make any unnecessary calculation and match percentage for a pair of users is always fresh.

Waseem
  • 8,232
  • 9
  • 43
  • 54
  • Well not quite ... The system looks for common answered questions first ... then calculates the percentage of how much importance value one user reached from another user. Let's say User1 and User2 have 10 common questions and User2 only gave one acceptable answer to those 10 questions according to User1's settings which answers he accepts. And User1 gave the value 10 to every one of those questions. That means User2 reached 10% match to User1. The same thing is done the other way around. – Mexxer Aug 20 '11 at 13:48
  • The thing is... every user somehow probably gave an answer to a question other users also accept. This still results in a match percentage calculation for every single user pair – Mexxer Aug 20 '11 at 13:49
  • 1
    I think the first comments to my question would still work the best. Split it up in 2 apps ... have a match table like you said, let the second app to all the calculation ... if one user answers a new question then the system looks for all users who answered the same question and recalculates the match percentage for those user pairs. – Mexxer Aug 20 '11 at 14:54
  • I think this will work as match percentage of users changes when a new answer is posted by a user. – Waseem Aug 20 '11 at 17:05
  • Yup, I still need a way to handle that huge table though. Because it grows exponentially ... if I have 50k users (and 50k isn't THAT much) I already have 2,5 billion table rows. – Mexxer Aug 20 '11 at 17:18
  • That's a trade off which you need decide on. Large table, fast response. No table, slow response. It turns out to be that space is cheaper than cpu clock cycles. I will suggest you to analyze both the approaches and choose one that suits you. – Waseem Aug 20 '11 at 17:54
1

@Wassem's answer seems on spot to your problem. I would also suggest you take an approach where percentages are updated on new answers and new accepted answers.

I have created a db only solution(gist), which would work but has an additional complexity of an intermediate table.

Ideally you should create two more tables, one for importance and another for percentage matches. You should create/insert/delete rows in these tables when user assigns/updates importance to an answer or marks some answer as acceptable. You can also leverage delayed_job or rescue to update the tables in background on the particular actions.

You may need to run the sqls once in while to sync up the data in the two new tables as there can be inconsistencies arising due to concurrency and also due to ordering of update actions in certain cases.

Updates on a accepted answer should be straight forward as you only need to update one pair. But in case somebody assigns importance to a question, there can be a lot calculations and a lot of percentages might need updation. To avoid this you might chose to only maintain the table with sums of importance for each pair, update it when required and calculate actual percentages on the fly(in db off-course).

rubish
  • 10,887
  • 3
  • 43
  • 57
  • The thing is, new importances are always assigned. Users rarely just add accepted answers alone. They usually answer a question which creates the UserAnswer and with that many AcceptedAnswers are created. So most of the time there would be ALOT of calculations. Maybe it's really best to let it calculate on the fly? I guess I have to test both alternatives and see which performs best. – Mexxer Aug 21 '11 at 15:12