0

The script is the following - aiming to show the differences in average click through rates by keyword ranking position - highlighting queries/pages with under performing ctrs.

Until recently it has been working fine - however it now gives me the below ZeroDivisionError.

import os
import sys
import math
from statistics import median
import numpy as np
import pandas as pd

in_file = 'data.csv'
thresh = 5

df = pd.read_csv(in_file)
# Round position to tenths
df = df.round({'position': 1})
# Restrict garbage 1 impression, 1 click, 100% CTR entries
df = df[df.clicks >= thresh]
df.head()

def apply_stats(row, df):

    if int(row['impressions']) > 5:

        ctr = float(row['ctr'])
        pos = row['position']

        # Median
        median_ctr = median(df.ctr[df.position==pos])
        # Mad
        mad_ctr = df.ctr[df.position==pos].mad()

        row['score'] = round(float( (1 * (ctr - median_ctr))/mad_ctr ), 3 ) 
        row['mad'] = mad_ctr
        row['median'] = median_ctr

    return row

df = df.apply(apply_stats, args=(df,), axis = 1)
df.to_csv('out2_' + in_file)
df.head()

The error I'm receiving is this:

-----------------------------------------
ZeroDivisionErrorTraceback (most recent call last)
<ipython-input-33-f1eef41d1c9a> in <module>()
----> 1 df = df.apply(apply_stats, args=(df,), axis = 1)
      2 df.to_csv('out2_' + in_file)
      3 df.head()

~\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, result_type, args, **kwds)
   6002                          args=args,
   6003                          kwds=kwds)
-> 6004         return op.get_result()
   6005 
   6006     def applymap(self, func):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in get_result(self)
    140             return self.apply_raw()
    141 
--> 142         return self.apply_standard()
    143 
    144     def apply_empty_result(self):

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_standard(self)
    246 
    247         # compute the result using the series generator
--> 248         self.apply_series_generator()
    249 
    250         # wrap results

~\Anaconda3\lib\site-packages\pandas\core\apply.py in apply_series_generator(self)
    275             try:
    276                 for i, v in enumerate(series_gen):
--> 277                     results[i] = self.f(v)
    278                     keys.append(v.name)
    279             except Exception as e:

~\Anaconda3\lib\site-packages\pandas\core\apply.py in f(x)
     72         if kwds or args and not isinstance(func, np.ufunc):
     73             def f(x):
---> 74                 return func(x, *args, **kwds)
     75         else:
     76             f = func

<ipython-input-32-900a8cda8fce> in apply_stats(row, df)
     11         mad_ctr = df.ctr[df.position==pos].mad()
     12 
---> 13         row['score'] = round(float( (1 * (ctr - median_ctr))/mad_ctr ), 3 )
     14         row['mad'] = mad_ctr
     15         row['median'] = median_ctr

ZeroDivisionError: ('float division by zero', 'occurred at index 317')

The data in the CSV are all integers for clicks, impressions + floats for ctr, position.

Is there an error in the script or likely a data formatting issue?

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • I doubt we'll be able to replicate this issue, we don't have access to `data.csv`. Please supply a [mcve]. – jpp Jun 26 '18 at 10:17

2 Answers2

1

It looks like your getting a row where mad_ctr is zero, so just add a check for that case:

row['score'] = round(float( (1 * (ctr - median_ctr))/mad_ctr ), 3 ) if mad_ctr != 0 else 0

This will set score to zero if mad_ctr is zero. But you could also use None or some other default value if you prefer.

Richard Inglis
  • 5,888
  • 2
  • 33
  • 37
0

If I read the error correctly, you have at some point some row for which variable mad_ctr, which appears as the divider to compute the score, is equal to zero (it seems to be happening for the row which has index 317).

Since the mad function computes the mean absolute deviation, it might be that for that particular row all values are the same and therefore the deviation is zero.

It's a problem related to the data you have and the things you want to compute.

ayhan
  • 70,170
  • 20
  • 182
  • 203