given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

Question

We build software that audits fees charged by banks to merchants that accept credit and debit cards. Our customers want us to tell them if the card processor is overcharging them. Per-transaction credit card fees are calculated like this:

fee = fixed + variable*transaction_price

A "fee scheme" is the pair of (fixed, variable) used by a group of credit cards, e.g. "MasterCard business debit gold cards issued by First National Bank of Hollywood". We believe there are fewer than 10 different fee schemes in use at any time, but we aren't getting a complete nor current list of fee schemes from our partners. (yes, I know that some "fee schemes" are more complicated than the equation above because of caps and other gotchas, but our transactions are known to have only a + bx schemes in use).

Here's the problem we're trying to solve: we want to use per-transaction data about fees to derive the fee schemes in use. Then we can compare that list to the fee schemes that each customer should be using according to their bank.

The data we get about each transaction is a data tuple: (card_id, transaction_price, fee).

transaction_price and fee are in integer cents. The bank rolls over fractional cents for each transation until the cumulative is greater than one cent, and then a "rounding cent" will be attached to the fees of that transaction. We cannot predict which transaction the "rounding cent" will be attached to.

card_id identifies a group of cards that share the same fee scheme. In a typical day of 10,000 transactions, there may be several hundred unique card_id's. Multiple card_id's will share a fee scheme.

The data we get looks like this, and what we want to figure out is the last two columns.

card_id    transaction_price       fee        fixed        variable
=======================================================================
12345      200                     22         ?            ?
67890      300                     21         ?            ?
56789      150                      8         ?            ?
34567      150                      8         ?            ?
34567      150    "rounding cent"-> 9         ?            ?
34567      150                      8         ?            ?

The end result we want is a short list like this with 10 or fewer entries showing the fee schemes that best fit our data. Like this:

fee_scheme_id       fixed     variable
======================================
1                      22            0
2                      21            0
3                       ?            ?
4                       ?            ?
...

The average fee is about 8 cents. This means the rounding cents have a huge impact and the derivation above requires a lot of data.

The average transaction is 125 cents. Transaction prices are always on 5-cent boundaries.

We want a short list of fee schemes that "fit" 98%+ of the 3,000+ transactions each customer gets each day. If that's not enough data to achieve 98% confidence, we can use multiple days' of data.

Because of the rounding cents applied somewhat arbitrarily to each transaction, this isn't a simple algebra problem. Instead, it's a kind of statistical clustering exercise that I'm not sure how to solve.

Any suggestions for how to approach this problem? The implementation can be in C# or T-SQL, whichever makes the most sense given the algorithm.

Is there any significant pattern in the transaction prices? THis would help simplify the problem — Eugen Rieck, Dec 22 '11 at 19:44
Do you know the fee schemes in advance, or should the algorithm derive those? It's a much easier problem if you have fee schemes pre-defined. — James Thompson, Dec 31 '11 at 03:32
@James Thompson - In some cases we know the fee schemes, and in other cases we don't, and in all cases we don't trust the information we get from the card processors. :-) — Justin Grant, Dec 31 '11 at 22:58

thiton · Accepted Answer · 2011-12-22T19:43:19.683

Hough transform

Consider your problem in image terms: If you would plot your input data on a diagram of price vs. fee, each scheme's entries would form a straight line (with rounding cents being noise). Consider the density map of your plot as an image, and the task is reduced to finding straight lines in an image. Which is just the job of the Hough transform.

You would essentially approach this by plotting one line for each transaction into a diagram of possible fixed fee versus possible variable fee, adding the values of lines where they cross. At the points of real fee schemes, many lines will intersect and form a large local maximum. By detecting this maximum, you find your fee scheme, and even a degree of importance for the fee scheme.

This approach will surely work, but might take some time depending on the resolution you want to achieve. If computation time proves to be an issue, remember that a Voronoi diagram of a coarse Hough space can be used as a classificator - and once you have classified your points into fee schemes, simple linear regression solves your problem.

Great answer. It was a toss-up between your answer which gave me the best conceptual overview of the problem space, and @dmckee's answer which provided more implementation detail, but in this case the conceptual help was good enough so you get the +15. Thanks! — Justin Grant, Dec 31 '11 at 23:05

score 4 · Answer 2 · answered Dec 22 '11 at 20:02

4

Considering, that a processing query's storage requirements are in the same power of 2 as a day's worth of transaction data, I assume that such storage is not a problem, so:

First pass: Group the transactions for each card_id by transaction_price, keeping card_id, transaction_price and average fee. This can easily be done in SQL. This assumes, there are not outliers - but you can catch those at after this stage if so required. The resulting number of rows is guaranteed to be no higher than the number of raw data points.
Second pass: Per group walk these new data points (with a cursor or in C#) and calculate the average value of b. Again any outliers can be caught if desired after this stage.
Third pass: Per group calculate the average value of a, now that b is known. This is basic SQL. Outliers as allways

If you decide to do the second step in a cursor you can stuff all that into a stored procedure.

Different card_id groups, that use the same fee scheme can now be coalesced (Sorry of this is the wrong word, non-english native) into fee schemes by rounding a and b with a sane precision and again grouping.

answered Dec 22 '11 at 20:02

Eugen Rieck

64,175
10
70
92

Could you please elaborate on the second step? I see absolutely no way to calculate b trivially when a is unknown. – thiton Dec 22 '11 at 20:11
Between two data points you know the difference in price (delta x) and the difference in fee (delta y) - that easily translates to b in y=a+bx – Eugen Rieck Dec 22 '11 at 20:13
OK, fair enough, didn't really concentrate on the relatively many data values per `card_id`. So your second and third pass is simply a linear regression per `card_id`. – thiton Dec 22 '11 at 20:22
This is what the problem is about: Linear regression with a certain confidence for subsets of data that are known to belong to the same regression group, while later combining those regression groups that turned out to have the same solution into a solution group. – Eugen Rieck Dec 22 '11 at 20:25

score 2 · Answer 3 · edited May 23 '17 at 12:32

The Hough transform is the most general answer, though I don't know how one would implement it in SQL (rather than pulling the data out and processing it in a general purpose language of your choice).

Alas, the naive version is known to be slow if you have a lot of input data (1000 points is kinda medium sized) and if you want high precision results (scales as size_of_the_input / (rho_precision * theta_precision)).

There is a faster approach based on 2^n-trees, but there are few implementations out on the web to just plug in. (I recently did one in C++ as a testbed for a project I'm involved in. Maybe I'll clean it up and post it somewhere.)

If there is some additional order to the data you may be able to do better (i.e. do the line segments form a piecewise function?).

Naive Hough transform

Define an accumulator in (theta,rho) space spanning [-pi,pi) and [0,max(hypotenuse(x,y)] as an 2D-array.

Foreach point in the input data
   Foreach bin in theta
      find the distance rho of the altitude from the origin to 
      a line through (a,y) and making angle theta with the horizontal
      rho = x cos(theta) + y sin(theta)
      and increment the bin (theta,rho) in the accumulator
Find the maximum bin in the accumulator, this 
represents the most line-like structure in the data
if (theta !=0) {a = rho/sin(theta); b = -1/tan(theta);}

Reliably getting multiple lines out of a single pass takes a little more bookkeeping, but it is not significantly harder.

You can improve the result a little by smoothing the data near the candidate peaks and fitting to get sub-bin precision which should be faster than using smaller bins and should pickup the effect of the "rounding" cents fairly smoothly.

score 0 · Answer 4 · answered Dec 22 '11 at 20:21

You're looking at the rounding cent as a significant source of noise in your calculations, so I'd focus on minimizing the noise due to that issue. The easiest way to do this IMO is to increase the sample size.

Instead of viewing your data as thousands of y=mx + b (+Rounding) group your data into larger subsets:

If you combine X transactions with the same and look at this as (sum of X fees) = (variable rate)*(sum of X transactions) + X(base rates) (+Rounding) your rounding number the noise will likely fall to the wayside.

Get enough groups of size 'X' and you should be able to come up with a pretty close representation of the real numbers.

given 10 functions y=a+bx and 1000's of (x,y) data points rounded to ints, how to derive 10 best (a,b) tuples?

4 Answers4