Efficient algorithm for association Obj1 to Obj2 based on a rule set

Question

I have a table containing millions of transaction records(Obj1) which looks like this

TransactionNum  Country   ZipCode State    TransactionAmount
1               USA        94002   CA         1000
2               USA        00023   FL         1000

I have another table containing Salesreps records(Obj2),again in hundreds of thousands.

SalesrepId   PersonNumber     Name
Srp001          123           Rohan
Srp002          124           Shetty

I have a few ruleset tables,where basically rules are defined as below

Rule Name : Rule 1
Qualifying criteria : Country = "USA" and (ZipCode = 94002 or State = "FL")
Credit receiving salesreps :
- Srp001 gets 70%
- Srp002 gets 30%

The qualifying criteria is for the transactions,which means if the transaction attributes match the criteria in the Rule then credits are assigned to the salesreps defined in the rule's credit receiver section.

Now,I need an algorithm which populates a result table as below

ResultId TransactionNumber SalesrepId  Credit
1        1                 Srp001        700
2        2                 Srp002        300

What is the efficient algorithm to do this?

Can you give a rough estimate of the numbers of transactions, sales reps and rules? An algorithm that's efficient for 3 rules and 1M transactions may be awful for 1000 rules and 2000 transactions. — ugoren, Jan 02 '12 at 19:57
Transaction records will be ~15 million.Rules will be in ~4000.Salesreps table will contain ~50000 — Rohan, Jan 02 '12 at 20:13
The format of your data is not at all clear. In the Salesreps records, you say that Attribute1 is an intValue, but in the Transaction records and the rules, Attribute1 is apparently a string. Also, how many attributes are there? Your transaction and salesrep records show two, but your sample rule shows three. Also, is "salesrep1" in your sample rule the same as SalesrepId == 001? From what you've told us so far, there is not enough information and there are too many ambiguities to get you anything like a good answer. — Jim Mischel, Jan 02 '12 at 20:19
How many possible values are there for each of the attributes? And how many attributes are there? — Jim Mischel, Jan 02 '12 at 20:23
Jim,Sorry for the confusion.I have edited the question to be more clear.Basically the criteria in the rules are for transaction attributes only. — Rohan, Jan 03 '12 at 05:22

score 1 · Accepted Answer · answered Jan 04 '12 at 04:43

So your real problem is how to quickly match transactions to potential rules. You can do this with an inverted index that says which rules match particular values for the attributes. For example, let's say you have these three rules:

Rule 1: if Country = "USA" and State = "FL"
            S1 gets 100%
Rule 2: if Country = "USA" and (State = "CO" or ZIP = 78640)
            S2 gets 60%
            S3 gets 40%
Rule 3: if Country = "UK"
            S3 gets 70%
            S2 gets 30%

Now, you process your rules and create output like this:

Country,USA,Rule1
State,FL,Rule1
Country,USA,Rule2
State,CO,Rule2
ZIP,78640,Rule2
Country,UK,Rule3

You then process that output (or you can do it while you're processing the rules) and build three tables. One maps Country values to rules, one maps State values to rules, and one maps ZIP values to rules. You end up with something like:

Countries:
    USA, {Rule1, Rule2}
    UK, {Rule3}
States:
    FL, {Rule1}
    CO, {Rule2}
    "*", {Rule3}
ZIP:
    78640, {Rule2}
    "*", {Rule1, Rule3}

The "*" value is a "don't care," which will match all rules that don't specifically mention that field. Whether this is required depends on how you've structured your rules.

The above indexes are constructed whenever your rules change. With 4000 rules, it shouldn't take any time at all, and the list size shouldn't be very large.

Now, given a transaction that has a Country value of "USA", you can look in the Countries table to find all the rules that mention that country. Call that list Country_Rules. Do the same thing for States and ZIP codes.

You can then do a list intersection. That is, build another list called Country_And_State_Rules that contains only those rules that exist in both the Country_Rules and State_Rules lists. That will typically be a small set of possible rules. You could then go through them one-by-one, testing country, state, and ZIP code, as required.

What you're building is essentially a search engine for rules. It should allow you to narrow the candidates from 4,000 to just a handful very quickly.

There are a few problems that you'll have to solve. Having conditional logic ("OR"), complicates things a little bit, but it's not intractable. Also, you have to determine how to handle ambiguity (what if two rules match?). Or, if no rules match the particular Country and State, then you have to back up and check for rules that only match the Country ... or only match the State. That's where the "don't care" comes in.

If your rules are sufficiently unambiguous, then in the vast majority of cases you should be able to pick the relevant rule very quickly. Some few cases will require you to search many different rules for some transactions. But those cases should be pretty rare. If they're frequent, then you need to consider re-examining your rule set.

Once you know which rule applies to a particular transaction, you can easily look up which salesperson gets how much, since the proportions are stored with the rules.

Thanks Jim for this approach.The problem of ambiguity can be solved by having these rules in a hierarchy and by determining the relative rank. — Rohan, Jan 04 '12 at 06:18

Efficient algorithm for association Obj1 to Obj2 based on a rule set

1 Answers1