Datamining is a method that needs really enormous amounts of space for storage and also enormous amounts of computing power.
I give you an example:
Imagine, you are the boss of a big chain of supermarkets like Wal-Mart, and you want to find out how to place your products in your market so that consumers spend lots of money when they enter your shops.
First of all, you need an idea. Your idea is to find products of different product-groups that are often bought together. If you have such a pair of products, you should place those products as far away as possible. If a customer wants to buy both, he/she has to walk through your whole shop and on this way you place other products that might fit well to one of that pair, but are not sold as often. Some of the customers will see this product and buy it, and the revenue of this additional product is the revenue of your datamining-process.
So you need lots of data. You have to store all data that you get from all buyings of all your customers in all your shops. When a person buys a bottle of milk, a sausage and some bread, then you need to store what goods have been sold, in what amount, and the price. Every buying needs its own ID if you want to get noticed that the milk and the sausage have been bought together.
So you have a huge amount of data of buyings. And you have a lot of different products. Let’s say, you are selling 10.000 different products in your shops. Every product can be paired with every other. This makes 10,000 * 10,000 / 2 = 50,000,000 (50 Million) pairs. And for each of this possible pairs you have to find out, if it is contained in a buying. But maybe you think that you have different customers at a Saturday afternoon than at a Wednesday late morning. So you have to store the time of buying too. Maybee you define 20 time slices along a week. This makes 50M * 20 = 1 billion records. And because people in Memphis might buy different things than people in Beverly Hills, you need the place too in your data. Lets say, you define 50 regions, so you get 50 billion records in your database.
And then you process all your data. If a customer did buy 20 products in one buying, you have 20 * 19 / 2 = 190 pairs. For each of this pair you increase the counter for the time and the place of this buying in your database. But by what should you increase the counter? Just by 1? Or by the amount of the bought products? But you have a pair of two products. Should you take the sum of both? Or the maximum? Better you use more than one counter to be able to count it in all ways you can think of.
And you have to do something else: Customers buy much more milk and bread then champagne and caviar. So if they choose arbitrary products, of course the pair milk-bread has a higher count than the pair champagne-caviar. So when you analyze your data, you must take care of some of those effects too.
Then, when you have done this all you do your datamining-query. You select the pair with the highest ratio of factual count against estimated count. You select it from a database-table with many billion records. This might need some hours to process. So think carefully if your query is really what you want to know before you submit your query!
You might find out that in rural environment people on a Saturday afternoon buy much more beer together with diapers than you did expect. So you just have to place beer at one end of the shop and diapers on the other end, and this makes lots of people walking through your whole shop where they see (and hopefully buy) many other things they wouldn't have seen (and bought) if beer and diapers was placed close together.
And remember: the costs of your datamining-process are covered only by the additional bargains of your customers!
conclusion:
- You must store pairs, triples of even bigger tuples of items which will need a lot of space. Because you don't know what you will find at the end, you have to store every possible combination!
- You must count those tuples
- You must compare counted values with estimated values