2

I have some code in Python which connects to SQL Server and returns a crosstab of data using pyodbc.

I then perform some statistical analysis on that data row by row, as each row contains statistical data unique to a vendor. I've got most of the code working just fine, can iterate through each column in each row and return lots of useful statistical analysis. I've also got matplotlib working, creating a scatter plot and drawing an OLS regression line over the data. The last piece of this script I need is to get the outliers, which I get using the following:

import statsmodels.api as sm

results = sm.OLS(y, sm.add_constant(x)).fit()
test = results.outlier_test()

The test data look like this:

[[ -1.66666636e-01   8.70954193e-01   1.00000000e+00]
[  1.85524023e-01   8.56527067e-01   1.00000000e+00]
[ -5.07693609e-01   6.22677469e-01   1.00000000e+00]
[ -5.22476578e-01   6.12716252e-01   1.00000000e+00]
[ -5.40267858e-01   6.00836859e-01   1.00000000e+00]
[ -5.61134260e-01   5.87059066e-01   1.00000000e+00]
[  1.11050592e+01   6.03423147e-07   7.84450092e-06]
[  1.97665390e-01   8.47267021e-01   1.00000000e+00]
[ -3.10806108e-01   7.62329771e-01   1.00000000e+00]
[ -2.02176433e-01   8.43832634e-01   1.00000000e+00]
[  4.36313403e-02   9.66057205e-01   1.00000000e+00]
[ -2.89236184e-01   7.78308296e-01   1.00000000e+00]
[ -5.49558759e-01   5.94681341e-01   1.00000000e+00]]

To iterate through this and determine outliers in these data:

outliers = ([x[i],y[i]] for i,t in enumerate(test) if t[2] < 0.5)

This runs in a for loop, for row in rows where rows are the rows returned from the SQL query. So I'm performing this outlier test on each row of data, to find which data points are potential outliers from that particular result set (the crosstab is actually a calculated result set pulled from another query, but that's not important here). This test output is for one row from data.

In some cases, there are multiple outliers in the data. However, I can't seem to find a way to iterate through each outliers object. No matter what I try, I only get the first outlier result in outliers, even when there are multiple known outliers. It's not a problem with the data because I've proven it through other methods, I just can't seem to iterate through the outliers generator object.

I'm fairly new to generator objects, but I have done some research on them. I have a decent understanding of how they work, but even code that I thought was working, wasn't.

Using something like

for i in outliers:
    print i

I only get the first outlier in the list: [6, 136.84] -- (In the example data given above, this the only outlier but as stated elsewhere in this question, even when there are known multiple outliers only the first one in the set is returned)

Using

for i in list(outliers):
    print i

gives me the same results.

Using

for i in list(next(outliers)):
    print i

results in returning the two values, x and y on separate lines, as if iterating through the sublist [x, y] rather than the outliers generator.

The last thing I've tried is various permutations of this

try:
    for i in next(outliers):
        print list(next(outliers))
except StopIteration:
    pass

I'll note that this particular code doesn't actually print anything.

I've also tried

try:
    if next(outliers, None) != None:
        print list(outliers)
except StopIteration:
    pass

which results in

[]

I had something working yesterday but for some reason it was skipping some results and I couldn't figure out why. Unfortunately, I lost that code and now I'm back at square one unable to make any further progress.

EDIT:

Here is a test data set which contains multiple outliers:

[[-1.06904497  0.31017523  1.        ]
[        inf  0.          0.        ]
[-0.74947341  0.47083534  1.        ]
[-0.61974867  0.54928322  1.        ]
[-0.50178907  0.62667871  1.        ]
[-0.3917734   0.70344466  1.        ]
[-0.28680336  0.78011746  1.        ]
[-0.18448201  0.85732288  1.        ]
[-0.08262629  0.93577921  1.        ]
[ 0.02097215  0.98368044  1.        ]
[ 0.12880164  0.90006829  1.        ]
[ 0.24397502  0.8121827   1.        ]
[ 0.37079182  0.71852659  1.        ]]

EDIT OF THE EDIT:

This question can be closed. I must have been getting false positives on which data points were outliers yesterday, which resulted in my thinking the generator iteration wasn't working when I was only getting one result today. After using some test data which for certain contains outliers, my generator iteration is working correctly.

LegendaryDude
  • 562
  • 8
  • 23
  • I think you need `outliers = [...]` instead of `outliers = (...)`. – dparpyani Apr 18 '14 at 15:23
  • Even using `outliers = [...]`, it still only returns the first `[x,y]` pairing in `outliers`, rather than every `[x,y]` pairing generated. – LegendaryDude Apr 18 '14 at 15:26
  • Oops, nevermind. I didn't know `(...)` created a generator. – dparpyani Apr 18 '14 at 15:27
  • What did you expect from `next`? Next returns the following element in the iteration. `outliers` produces 2-element tuples so `next` returns *one* 2-element tuple, the call to `list` transforms the tuple into a list of two elements and you are iterating over it. I really have no idea why you add a call to `next` if you want to iterate over the whole generator. – Bakuriu Apr 18 '14 at 15:44
  • Hi Bakuriu, I guess that illustrates my limited understanding of generators, or I'm having a hard time wrapping my head around the logic. What would you suggest? – LegendaryDude Apr 18 '14 at 15:47
  • in `test`, there is only one line that matches `t[2] > 05`, that's `[ 1.11050592e+01 6.03423147e-07 7.84450092e-06]`. Therefore outliers contains only 1 element (all the others have `t[2] == 1`) – njzk2 Apr 18 '14 at 15:48
  • njzk2, I know this -- the `test` data given in the example does not contain multiple outliers, however I do have other data that does. Let me add that to the post. – LegendaryDude Apr 18 '14 at 15:52
  • 1
    Your first loop is the right way to iterate over a generator. Your problem suggests that you have an error in the generator expression itself. Could you provide some test data with multiple known outliers where this still only gives one? – lvc Apr 18 '14 at 15:52
  • @lvc: Added a `test` set which I believe contains multiple outliers. – LegendaryDude Apr 18 '14 at 15:56
  • @devOpsEv that data still only has one row that matches t[2] < 0.5, which is row 1, `[inf 0. 0.]` - all other rows have t[2] being 1, which is bigger than 0.5. Are you sure that t[2] < 0.5 is the right test? – lvc Apr 18 '14 at 16:04
  • do `print list(outliers)` in a case where there is more than 1 outlier. – Scott Apr 18 '14 at 16:07
  • @lvc, Based on this, [Can scipy.stats identify and mask obvious outliers?](http://stackoverflow.com/questions/10231206/can-scipy-stats-identify-and-mask-obvious-outliers), which is where I got the code to create the generator, yes. Perhaps I was mistaken about that data set. I'm working with 8 or 9 different sets, and I can't recall which one passed the "multiple outliers" test. – LegendaryDude Apr 18 '14 at 16:08
  • Now I'm not certain about my data. Running a test against `test` where I force a few outliers, I get printed results for each outlier. I must have been doing something to my data yesterday to cause false positives on the outliers that I have since undone. Back to the drawing board... Thanks for the comments, everyone. At least this question served as a good sanity check for myself. – LegendaryDude Apr 18 '14 at 16:24

0 Answers0