I have some code in Python which connects to SQL Server and returns a crosstab of data using pyodbc
.
I then perform some statistical analysis on that data row by row, as each row contains statistical data unique to a vendor. I've got most of the code working just fine, can iterate through each column in each row and return lots of useful statistical analysis. I've also got matplotlib
working, creating a scatter plot and drawing an OLS regression line over the data. The last piece of this script I need is to get the outliers, which I get using the following:
import statsmodels.api as sm
results = sm.OLS(y, sm.add_constant(x)).fit()
test = results.outlier_test()
The test
data look like this:
[[ -1.66666636e-01 8.70954193e-01 1.00000000e+00]
[ 1.85524023e-01 8.56527067e-01 1.00000000e+00]
[ -5.07693609e-01 6.22677469e-01 1.00000000e+00]
[ -5.22476578e-01 6.12716252e-01 1.00000000e+00]
[ -5.40267858e-01 6.00836859e-01 1.00000000e+00]
[ -5.61134260e-01 5.87059066e-01 1.00000000e+00]
[ 1.11050592e+01 6.03423147e-07 7.84450092e-06]
[ 1.97665390e-01 8.47267021e-01 1.00000000e+00]
[ -3.10806108e-01 7.62329771e-01 1.00000000e+00]
[ -2.02176433e-01 8.43832634e-01 1.00000000e+00]
[ 4.36313403e-02 9.66057205e-01 1.00000000e+00]
[ -2.89236184e-01 7.78308296e-01 1.00000000e+00]
[ -5.49558759e-01 5.94681341e-01 1.00000000e+00]]
To iterate through this and determine outliers in these data:
outliers = ([x[i],y[i]] for i,t in enumerate(test) if t[2] < 0.5)
This runs in a for
loop, for row in rows
where rows
are the rows returned from the SQL query. So I'm performing this outlier test on each row of data, to find which data points are potential outliers from that particular result set (the crosstab is actually a calculated result set pulled from another query, but that's not important here). This test
output is for one row
from data.
In some cases, there are multiple outliers in the data. However, I can't seem to find a way to iterate through each outliers
object. No matter what I try, I only get the first outlier result in outliers
, even when there are multiple known outliers. It's not a problem with the data because I've proven it through other methods, I just can't seem to iterate through the outliers
generator object.
I'm fairly new to generator objects, but I have done some research on them. I have a decent understanding of how they work, but even code that I thought was working, wasn't.
Using something like
for i in outliers:
print i
I only get the first outlier in the list: [6, 136.84]
-- (In the example data given above, this the only outlier but as stated elsewhere in this question, even when there are known multiple outliers only the first one in the set is returned)
Using
for i in list(outliers):
print i
gives me the same results.
Using
for i in list(next(outliers)):
print i
results in returning the two values, x
and y
on separate lines, as if iterating through the sublist [x, y]
rather than the outliers
generator.
The last thing I've tried is various permutations of this
try:
for i in next(outliers):
print list(next(outliers))
except StopIteration:
pass
I'll note that this particular code doesn't actually print anything.
I've also tried
try:
if next(outliers, None) != None:
print list(outliers)
except StopIteration:
pass
which results in
[]
I had something working yesterday but for some reason it was skipping some results and I couldn't figure out why. Unfortunately, I lost that code and now I'm back at square one unable to make any further progress.
EDIT:
Here is a test
data set which contains multiple outliers:
[[-1.06904497 0.31017523 1. ]
[ inf 0. 0. ]
[-0.74947341 0.47083534 1. ]
[-0.61974867 0.54928322 1. ]
[-0.50178907 0.62667871 1. ]
[-0.3917734 0.70344466 1. ]
[-0.28680336 0.78011746 1. ]
[-0.18448201 0.85732288 1. ]
[-0.08262629 0.93577921 1. ]
[ 0.02097215 0.98368044 1. ]
[ 0.12880164 0.90006829 1. ]
[ 0.24397502 0.8121827 1. ]
[ 0.37079182 0.71852659 1. ]]
EDIT OF THE EDIT:
This question can be closed. I must have been getting false positives on which data points were outliers yesterday, which resulted in my thinking the generator iteration wasn't working when I was only getting one result today. After using some test data which for certain contains outliers, my generator iteration is working correctly.