5

I have a quick question. I am trying to make a fraud detection app in java, the app will be primarily based on Benford's law. Benford's law is super cool, it basically can be interpreted to say that in a real financial transaction the first digit is commonly a 1, 2, or 3 and very rarely an 8, 9. I haven't been able to get the Benford formula translated into code that can be run in Java.

http://www.mathpages.com/home/kmath302/kmath302.htm This link has more information about what the Benford law is and how it can be used.

I know that I will have to use the java math class to be able to use a natural log function, but I am not sure how to do that. Any help would be greatly appreciated.

Thanks so much!!

starblue
  • 55,348
  • 14
  • 97
  • 151
ohGosh
  • 539
  • 1
  • 7
  • 10
  • This is pretty incredible. After 20 minutes even an exceedingly trivial Java+Math question has not been answered. Soon I'll post a Delphi version of the desired snippet! – Andreas Rejbrand Oct 19 '11 at 00:20
  • It's not a reflection on Java; it says more about the level of interest in bothering to give an answer. All that Delphi power didn't make it the king o' languages. – duffymo Oct 19 '11 at 00:43
  • @duffymo: I was more thinking that since no one that knows Java seems interested in posting an answer, then I will have to post an answer in the only language I know well, that is, in Delphi... – Andreas Rejbrand Oct 19 '11 at 00:47
  • Now it's clear - you don't know anything about Java, including how to read javadocs. That would explain your erroneous down vote and explanation below. – duffymo Oct 19 '11 at 00:59
  • 1
    @duffymo: I do not know any Java at all. However, it is obvious that you have misunderstood the OP's question. – Andreas Rejbrand Oct 19 '11 at 01:02
  • @Andreas Rejbrand -- what is the point of posting an answer undesired by the question asker? They asked for it in Java. Therefore it should be answered in Java. To post otherwise is just asking others to down vote it. – Chris Aldrich Oct 19 '11 at 12:48
  • @Chris Aldrich: Of course, that is why I did never post a Delphi solution! I was just kidding. (I thought it was strange that such an easy question didn't get an answer for so long.) – Andreas Rejbrand Oct 19 '11 at 13:18

3 Answers3

8

@Rui has mentioned how to compute the probability distribution function, but that's not going to help you much here.

What you want to use is either the Kolmogorov-Smirnov test or the Chi-squared test. Both are for used for comparing data to a known probability distribution to determine whether the dataset is likely/unlikely to have that probability distribution.

Chi-squared is for discrete distributions, and K-S is for continuous.


For using chi-squared with Benford's law, you would just create a histogram H[N], e.g. with 9 bins N=1,2,... 9, iterate over the dataset to check the first digit to count # of samples for each of the 9 non-zero digits (or first two digits with 90 bins). Then run the chi-squared test to compare the histogram with the expected count E[N].

For example, let's say you have 100 pieces of data. E[N] can be computed from Benford's Law:

E[1] = 30.1030 (=100*log(1+1))
E[2] = 17.6091 (=100*log(1+1/2))
E[3] = 12.4939 (=100*log(1+1/3))
E[4] =  9.6910
E[5] =  7.9181
E[6] =  6.6946
E[7] =  5.7992
E[8] =  5.1152
E[9] =  4.5757

Then compute Χ2 = sum((H[k]-E[k])^2/E[k]), and compare to a threshold as specified in the test. (Here we have a fixed distribution with no parameters, so the number of parameters s=0 and p = s+1 = 1, and the # of bins n is 9, so the # of degrees of freedom = n-p = 8*. Then you go to your handy-dandy chi-squared table and see if the numbers look ok. For 8 degrees of freedom the confidence levels look like this:

Χ2 > 13.362: 10% chance the dataset still matches Benford's Law

Χ2 > 15.507: 5% chance the dataset still matches Benford's Law

Χ2 > 17.535: 2.5% chance the dataset still matches Benford's Law

Χ2 > 20.090: 1% chance the dataset still matches Benford's Law

Χ2 > 26.125: 0.1% chance the dataset still matches Benford's Law

Suppose your histogram yielded H = [29,17,12,10,8,7,6,5,6], for a Χ2 = 0.5585. That's very close to the expected distribution. (maybe even too close!)

Now suppose your histogram yielded H = [27,16,10,9,5,11,6,5,11], for a Χ2 = 13.89. There is less than a 10% chance that this histogram is from a distribution that matches Benford's Law. So I'd call the dataset questionable but not overly so.

Note that you have to pick the significance level (e.g. 10%/5%/etc.). If you use 10%, expect roughly 1 out of every 10 datasets that are really from Benford's distribution to fail, even though they're OK. It's a judgement call.

Looks like Apache Commons Math has a Java implementation of a chi-squared test:

ChiSquareTestImpl.chiSquare(double[] expected, long[] observed)


*note on degrees of freedom = 8: this makes sense; you have 9 numbers but they have 1 constraint, namely they all have to add up to the size of the dataset, so once you know the first 8 numbers of the histogram, you can figure out the ninth.


Kolmogorov-Smirnov is actually simpler (something I hadn't realized until I found a simple enough statement of how it works) but works for continuous distributions. The method works like this:

  • You compute the cumulative distribution function (CDF) for your probability distribution.
  • You compute an empirical cumulative distribution function (ECDF), which is easily obtained by putting your dataset in sorted order.
  • You find D = (approximately) the maximum vertical distance between the two curves.

enter image description here

Let's handle these more in depth for Benford's Law.

  1. CDF for Benford's Law: this is just C = log10 x, where x is in the interval [1,10), i.e. including 1 but excluding 10. This can be easily seen if you look at the generalized form of Benford's Law, and instead of writing it log(1+1/n), writing it as log(n+1)-log(n) -- in other words, to get the probability of each bin, they're subtracting successive differences of log(n), so log(n) must be the CDF

  2. ECDF: Take your dataset, and for each number, make the sign positive, write it in scientific notation, and set the exponent to 0. (Not sure what to do if you have a number that is 0; that seems to not lend itself to Benford's Law analysis.) Then sort the numbers in ascending order. The ECDF is the number of datapoints <= x for any valid x.

  3. Calculate maximum difference D = max(d[k]) for each d[k] = max(CDF(y[k]) - (k-1)/N, k/N - CDF(y[k]).

Here's an example: suppose our dataset = [3.02, 1.99, 28.3, 47, 0.61]. Then ECDF is represented by the sorted array [1.99, 2.83, 3.02, 4.7, 6.1], and you calculate D as follows:

D = max(
  log10(1.99) - 0/5, 1/5 - log10(1.99),
  log10(2.83) - 1/5, 2/5 - log10(2.83),
  log10(3.02) - 2/5, 3/5 - log10(3.02),
  log10(4.70) - 3/5, 4/5 - log10(4.70),
  log10(6.10) - 4/5, 5/5 - log10(6.10)
)

which = 0.2988 (=log10(1.99) - 0).

Finally you have to use the D statistic -- I can't seem to find any reputable tables online, but Apache Commons Math has a KolmogorovSmirnovDistributionImpl.cdf() function that takes a calculated D value as input and tells you the probability that D would be less than this. It's probably easier to take 1-cdf(D) which tells you the probability that D would be greater than or equal to the value you calculate: if this is 1% or 0.1% it probably means that the data doesn't fit Benford's Law, but if it's 25% or 50% it's probably a good match.

Jason S
  • 184,598
  • 164
  • 608
  • 970
  • could you show this in Java with the java.lang.Math class? or at least point out which operators the question asker wants? I know the point isn't to spoon feed, but they did ask for a Java example (similar to what Rui did). – Chris Aldrich Oct 19 '11 at 12:52
  • 1
    That may answer the literal question the OP tried to ask, but just calculating the PDF of Benford's Law isn't really helpful on its own; instead you need to use statistical tests. But I have edited to include references to Apache's Java implementations of some statistics functions that are relevant to the task. – Jason S Oct 19 '11 at 21:06
4

If I understand correctly, you want the Benford formula in Java syntax?

public static double probability(int i) {   
    return Math.log(1+(1/(double) i))/Math.log(10);
}

Remember to insert a

import java.lang.Math;

after your package declaration.

I find it suspicious no one answered this yet.... >_>

Rui Vieira
  • 5,253
  • 5
  • 42
  • 55
  • 1
    because it's like asking how do you represent c = a+b in Java. The poster have not tried anything, and if I know SF well, if you don't post what you've tried and failed, you sure might not get a reply. – Akintayo Olusegun Oct 19 '11 at 11:36
1

I think what you are looking for is something like this:

for(int i = (int)Math.pow(10, position-1); i <= (Math.pow(10, position)-1); i++)
        {
           answer +=  Math.log(1+(1/(i*10+(double) digit)));
        }

answer *= (1/Math.log(10)));