Pandas read_sql with pyodbc to handle corrupt data

Question

I am working with a client that has a 4D database. Tableau won't connect to it. (That's a whole other problem and if you know the answer to that let me know.) What we've decided to do is essentially keep two copies of the data. I am building a tool in Python that will take any arbitrary table from their database and store a copy of it in a MySQL database. It will then run periodically and update the data as new data is added.

I would prefer to use SqlAlchemy but it does not support 4D. So, I'm using pyodbc with pandas. I'm using

data_chunks = pandas.read_sql("SELECT * FROM table_name", con=pyodbc_connection, chunksize=100000)

Then I turn around and use

chunk_df.to_sql("table_name", con=sqlalchemy_mysql_connection, index=False, if_exists="append")

to write it to the MySQL database.

Unfortunately on some of the tables I'm reading in, there is corrupt data and I get a ValueError saying that The year xxxxx is out of range.

The last function called in the trace was data = cursor.fetchmany(chunksize) which I believe is from pyodbc.

How can I read data from any arbitrary table and be able to handle the corrupt data gracefully and continue on?

Sounds like you need to adjust the schema of the MySQL table to allow such "corrupt" data values. Year column may have a constraint of allowable values. Adjust for other columns. — Parfait, Mar 06 '20 at 14:57
@Parfait It fails before I ever try to insert it into the MySQL table. I'm pretty sure it's failing when it tries to convert the SQL data types to Python data types. — ggrant, Mar 06 '20 at 17:15
So you're getting the error when you execute the `read_sql`? — Gord Thompson, Mar 06 '20 at 18:46
Please show fuller code not line snippets including full SQL statement for us to help. We need to see *year* object. — Parfait, Mar 06 '20 at 19:23
Also, is the `xxxxx` year value really outside `range(1, 10000)`? If so, that sounds like a bug in the ODBC driver. — Gord Thompson, Mar 06 '20 at 19:23
@GordThompson Yes the error occurs when I execute ```read_sql```. Also, the values were really out of range. Around 25600. Unfortunately, there is corrupt data in the 4D database and I can't really do anything about that. I just need to handle it the best I can. — ggrant, Mar 09 '20 at 17:32
@Parfait I edited it to show the SQL query. It is simply a SELECT * FROM table query. There is no year object. The error happens when I run that one line with ```read_sql```. — ggrant, Mar 09 '20 at 17:34
You might be able to use an [Output Converter function](https://github.com/mkleehammer/pyodbc/wiki/Using-an-Output-Converter-function) to work around the issue. If you can identify a specific row with bad data in a specific column then you could start by enabling an [ODBC trace](https://github.com/mkleehammer/pyodbc/wiki/Troubleshooting-%E2%80%93-Generating-an-ODBC-trace-log), run a query like`.read_sql("SELECT bad_col FROM table_name WHERE id_col=bad_id", ...` and see the ODBC data type of the offending column. — Gord Thompson, Mar 09 '20 at 19:11

score 1 · Accepted Answer · answered Mar 11 '20 at 19:54

You could conceivably use a pyodbc Output Converter function to intercept the corrupted date values and "fix" them using code similar to this:

def unpack_sql_type_timestamp(raw_bytes):
    y, m, d, h, n, s, f = struct.unpack("<h5HI", raw_bytes)
    if y > 9999:
        y = 9999
    elif y < 1:
        y = 1
    return datetime.datetime(y, m, d, h, n, s, f)

pyodbc_connection = pyodbc.connect(connection_string)

pyodbc_connection.add_output_converter(
    pyodbc.SQL_TYPE_TIMESTAMP, 
    unpack_sql_type_timestamp
)

data_chunks = pandas.read_sql_query(
    "SELECT * FROM table_name", 
    con=pyodbc_connection, 
    chunksize=100000
)

Pandas read_sql with pyodbc to handle corrupt data

1 Answers1