0

In Python 2.7, I'm connecting to an external data source using the following:

import pypyodbc
import pandas as pd
import datetime
import csv
import boto3
import os

# Connect to the DataSource
conn = pypyodbc.connect("DSN = FAKE DATA SOURCE; UID=FAKEID; PWD=FAKEPASSWORD")

# Specify the query we're going to run on it
script = ("SELECT * FROM table")

# Create a dataframe from the above query
df = pd.read_sql_query(script, conn)

I get the following error:

C:\Python27\python.exe "C:/Thing.py"
Traceback (most recent call last):
  File "C:/Thing.py", line 30, in <module>
   df = pd.read_sql_query(script,conn)
  File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 431, in read_sql_query
   parse_dates=parse_dates, chunksize=chunksize)
  File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1608, in read_query
   data = self._fetchall_as_list(cursor)
  File "C:\Python27\lib\site-packages\pandas-0.18.1-py2.7-win32.egg\pandas\io\sql.py", line 1617, in _fetchall_as_list
   result = cur.fetchall()
  File "build\bdist.win32\egg\pypyodbc.py", line 1819, in fetchall
  File "build\bdist.win32\egg\pypyodbc.py", line 1871, in fetchone
ValueError: could not convert string to float: ?

It's seems to me that in one of the float columns, there is a '?' symbol for some reason. I've reached out to the owner of the data source, but they cannot change the underlying table.

Is there a way to replace incorrect data like this using pandas? I've tried using replace after the read_sql_query statement, but I get the same error.

user2752159
  • 1,182
  • 3
  • 13
  • 29
  • 2
    Please show the complete traceback. Using `replace` *after* `read_sql_query` won't help if you get the error already with the code you've shown. If you don't get the error already with the code you've shown, then you need to show the code that causes the error. – BrenBarn May 18 '16 at 18:09
  • Sorry! I've added the full traceback. The code above causes the error on it's own. – user2752159 May 18 '16 at 19:38
  • Can you successfully read the data from the table using pyodbc directly? The error doesn't seem to have anything to do with pandas. – BrenBarn May 19 '16 at 02:40
  • No, even when I use a SQL client on the database I get this error. Sorry, I think my title isn't specific enough. My intent was to ask if I can use pandas to ignore or replace the incorrect data that is coming into the dataframe in the `read_sql_query` statement – user2752159 May 19 '16 at 14:07
  • Let me know if I should edit either the Title or the question to make it more clear, sorry for the confusion. I'd like to use pandas to replace the incorrect data with pd.nan, but I'm open to any solution. Sorry for my newbish-ness. – user2752159 May 19 '16 at 15:00
  • 1
    You can't use pandas to do anything with the data if you can't even read the data into pandas. There is no data coming into the DataFrame because the error is happening during the reading of the data, so the data is never actually fully read. Whatever the problem is with your data, you'll need to resolve it some other way before you can make use of pandas to manipulate it. – BrenBarn May 19 '16 at 17:50
  • Thanks for your help, it's clear I should change the scope of the question. Would you recommend I edit it here, or start a new thread? My question should have been is there a way to handle this incorrect data in Python so that it can get into pandas in the first place. – user2752159 May 20 '16 at 17:55
  • Probably better to ask a new question, since almost none of what's in this question will be relevant there. In that new question, you'll want to show a code example that doesn't use pandas (and ideally doesn't use anything but a simple DB connection), and if possible sample data. However, you may have the same problem: if you can't read the data using pyodbc (or whatever database library), you likely can't modify it that way either. – BrenBarn May 20 '16 at 18:32

1 Answers1

0

Hard to know for certain without having your data obviously, but you could try setting coerce_float to False, i.e. replace your last line with

df = pd.read_sql_query(script, conn, coerce_float=False)

See the documentation of read_sql_query.

Nils Gudat
  • 13,222
  • 3
  • 39
  • 60
  • Thanks for the answer! Unfortunately running `df = pd.read_sql_query(script, conn, coerce_float=False)` yields the same error, full traceback is above. – user2752159 May 18 '16 at 19:42