Recommender: Log user actions & datamine it – good solution

Question

I am planning to log all user actions like viewed page, tag etc.

What would be a good lean solution to data-mine this data to get recommendations?
Say like:

Figure all the interests from the viewed URL (assuming I know the associated tags)
Find out people who have similar interests. E.g. John & Jane viewed URLS related to cars etc

Edit:
It’s really my lack of knowledge in this domain that’s a limiting factor to get started.

Let me rephrase.
Lets say a site like stackoverflow or Quora. All my browsing history going through different questions are recorded and Quora does a data mining job of looking through it and populating my stream with related questions. I go through questions relating to parenting and the next time I login I see streams of questions about parenting. Ditto with Amazon shopping. I browse watches & mixers and two days later they send me a mail of related shopping items that I am interested.

My question is, how do they efficiently store these data and then data mine it to show the next relevant set of data.

I haven’t really thought about it. I’ll also need an efficient mechanism to store it. — Quintin Par, Aug 21 '12 at 14:26
This is too ambiguous to answer. Maybe you should start implementing it, then ask again as you come up with more concrete questions. — Snowball, Aug 21 '12 at 20:50
@Snowball agreed. This is too broad to get you reasonable replys. — Has QUIT--Anony-Mousse, Aug 22 '12 at 08:34
@QuintinPar: After reading your edit, I think the term you're looking for is a [recommender system](https://en.wikipedia.org/wiki/Recommender_system). A few algorithms for it are listed under the *Algorithms* section of the Wikipedia article. If you want to learn more about the subject, I highly recommended the [Coursera machine learning class](https://www.coursera.org/course/ml). It started just a couple days ago, so you can jump in now and still follow along. — Snowball, Aug 22 '12 at 15:49

score 3 · Answer 1 · answered Aug 24 '12 at 12:34

Datamining is a method that needs really enormous amounts of space for storage and also enormous amounts of computing power.

I give you an example:

Imagine, you are the boss of a big chain of supermarkets like Wal-Mart, and you want to find out how to place your products in your market so that consumers spend lots of money when they enter your shops.

First of all, you need an idea. Your idea is to find products of different product-groups that are often bought together. If you have such a pair of products, you should place those products as far away as possible. If a customer wants to buy both, he/she has to walk through your whole shop and on this way you place other products that might fit well to one of that pair, but are not sold as often. Some of the customers will see this product and buy it, and the revenue of this additional product is the revenue of your datamining-process.

So you need lots of data. You have to store all data that you get from all buyings of all your customers in all your shops. When a person buys a bottle of milk, a sausage and some bread, then you need to store what goods have been sold, in what amount, and the price. Every buying needs its own ID if you want to get noticed that the milk and the sausage have been bought together.

So you have a huge amount of data of buyings. And you have a lot of different products. Let’s say, you are selling 10.000 different products in your shops. Every product can be paired with every other. This makes 10,000 * 10,000 / 2 = 50,000,000 (50 Million) pairs. And for each of this possible pairs you have to find out, if it is contained in a buying. But maybe you think that you have different customers at a Saturday afternoon than at a Wednesday late morning. So you have to store the time of buying too. Maybee you define 20 time slices along a week. This makes 50M * 20 = 1 billion records. And because people in Memphis might buy different things than people in Beverly Hills, you need the place too in your data. Lets say, you define 50 regions, so you get 50 billion records in your database.

And then you process all your data. If a customer did buy 20 products in one buying, you have 20 * 19 / 2 = 190 pairs. For each of this pair you increase the counter for the time and the place of this buying in your database. But by what should you increase the counter? Just by 1? Or by the amount of the bought products? But you have a pair of two products. Should you take the sum of both? Or the maximum? Better you use more than one counter to be able to count it in all ways you can think of.

And you have to do something else: Customers buy much more milk and bread then champagne and caviar. So if they choose arbitrary products, of course the pair milk-bread has a higher count than the pair champagne-caviar. So when you analyze your data, you must take care of some of those effects too.

Then, when you have done this all you do your datamining-query. You select the pair with the highest ratio of factual count against estimated count. You select it from a database-table with many billion records. This might need some hours to process. So think carefully if your query is really what you want to know before you submit your query!

You might find out that in rural environment people on a Saturday afternoon buy much more beer together with diapers than you did expect. So you just have to place beer at one end of the shop and diapers on the other end, and this makes lots of people walking through your whole shop where they see (and hopefully buy) many other things they wouldn't have seen (and bought) if beer and diapers was placed close together.

And remember: the costs of your datamining-process are covered only by the additional bargains of your customers!

conclusion:

You must store pairs, triples of even bigger tuples of items which will need a lot of space. Because you don't know what you will find at the end, you have to store every possible combination!
You must count those tuples
You must compare counted values with estimated values

score 2 · Answer 2 · answered Aug 30 '12 at 06:08

Store each transaction as a vector of tags (i.e. visited pages containing these tags). Then do association analysis (i can recommend Weka) on this data to find associations using the "Associate" algorithms available. Effectiveness depends on a lot of different things of course.

One thing that a guy at my uni told me was that often you can simply create a vector of all the products that one person has bought and compare this with other peoples vectors and get decent recommendations. That is represent users as the products they buy or the pages they visit and do e.g. Jaccard similarity calculations. If the "people" are similar then look at products they bought that this person didn't. (Probably those that are the most common in the population of similar people)

Storage is a whole different ballgame, there are many good indices for vector data such as KD trees implemented in different RDBMs.

Take a course in datamining :) or just read one of the excellent textbooks available (I have read Introduction to data mining by Pang-Ning tan et al and its good.)

And regarding storing all the pairs of products etc, of course this is not done and more efficient algorithms based on support and confidence are used to prune the search space.

score 0 · Answer 3 · answered Aug 26 '12 at 19:45

0

I should say recommendation is machine learning issue. how to store the datas depends on which algorithm you chose.

answered Aug 26 '12 at 19:45

user1203650

300
2
3

Recommender: Log user actions & datamine it – good solution

3 Answers3