1

I am new to python and am writing a program that reads values from a .csv file, then displays a graph that shows the test results compared to the expected output for Benford's Law.

The .csv file has loan values which I need to read in the 1st column like below:

Values  Leading Digit   Number of occurances
170     1               88                   
900     9               62          
250     2               44          
450     4               51          
125     1               19          
.....

The main file, app.py:

 ...
 filename = filedialog.askopenfilename(filetypes=(
    ("Excel files", "*.csv"), ("All files", "*.*")))
 print(filename)
 try:
    with open(filename, 'rt') as csvfile:
        reader = csv.reader(csvfile, delimiter=',')
        next(reader, None)  # skip the headers
        for row in reader:
            minutePriceCloses.append(row[0])
            # calculate the percentage distribution of leading digits
        benford_test_data_dist = calc.getBenfordDist(minutePriceChanges)
        ....

in calc.py:

import numpy as np


def getBenfordDist(data):
# set initial dist to zero
dist = [0, 0, 0, 0, 0, 0, 0, 0, 0]
# for each figure, check what the first non-zero digit is, hacky multiply
# by 1000000 to handle small values
for d in data:
    # sneaky multiply by 1000000 to ensure that the leading digit is unlikely to be zero
    # since benfords law is assumed to relate somehow to scale invariance, this *SHOULDN'T* make a difference
    # but it might, so this might all be wrong :-)
    s = str(np.abs(d) * 1000000)
    for i in range(0, 8):
        if(s.startswith(str(i + 1))):
            dist[i] = dist[i] + 1
            break
# return fractions of the total for each digit
percentDist = []
# convert to % - todo, start using numpy vectors that allow scalar mult/div
for count in dist:
    percentDist.append(float(count) / len(data))
    # print(float(count))
return percentDist

Now the problem I am having is that the graph output is not correctly displaying the percentage results for the value column count divided by the total number of rows with values i.e for the values with leading digit of 1, the percentage on graph should be 0.25 and so on. There are 352 rows.

Please help. Thanks

Ngoni X
  • 145
  • 1
  • 1
  • 8
  • "the percentage on graph should be 0.25 and so on" - and what are the results of running your code? They may very well be different from what you'd expect as probability theory is all about _probabilities_, so you won't get any exact results. – ForceBru May 06 '18 at 12:31
  • testing : 53 expected: 105.35000000000002 testing : 101 expected: 61.60000000000001 testing : 66 expected: 43.75 testing : 11 expected: 33.949999999999996 testing : 37 expected: 27.65 testing : 28 expected: 23.450000000000003 testing : 36 expected: 20.299999999999997 testing : 18 expected: 17.849999999999998 testing : 0 expected: 16.1 – Ngoni X May 06 '18 at 12:43
  • The 1st testing one is showing 53, should be 88 – Ngoni X May 06 '18 at 12:44

0 Answers0