2

I am working with a client that has a 4D database. Tableau won't connect to it. (That's a whole other problem and if you know the answer to that let me know.) What we've decided to do is essentially keep two copies of the data. I am building a tool in Python that will take any arbitrary table from their database and store a copy of it in a MySQL database. It will then run periodically and update the data as new data is added.

I would prefer to use SqlAlchemy but it does not support 4D. So, I'm using pyodbc with pandas. I'm using

data_chunks = pandas.read_sql("SELECT * FROM table_name", con=pyodbc_connection, chunksize=100000)

Then I turn around and use

chunk_df.to_sql("table_name", con=sqlalchemy_mysql_connection, index=False, if_exists="append")

to write it to the MySQL database.

Unfortunately on some of the tables I'm reading in, there is corrupt data and I get a ValueError saying that The year xxxxx is out of range.

The last function called in the trace was data = cursor.fetchmany(chunksize) which I believe is from pyodbc.

How can I read data from any arbitrary table and be able to handle the corrupt data gracefully and continue on?

ggrant
  • 452
  • 7
  • 15
  • Sounds like you need to adjust the schema of the MySQL table to allow such "corrupt" data values. Year column may have a constraint of allowable values. Adjust for other columns. – Parfait Mar 06 '20 at 14:57
  • @Parfait It fails before I ever try to insert it into the MySQL table. I'm pretty sure it's failing when it tries to convert the SQL data types to Python data types. – ggrant Mar 06 '20 at 17:15
  • So you're getting the error when you execute the `read_sql`? – Gord Thompson Mar 06 '20 at 18:46
  • Please show fuller code not line snippets including full SQL statement for us to help. We need to see *year* object. – Parfait Mar 06 '20 at 19:23
  • Also, is the `xxxxx` year value really outside `range(1, 10000)`? If so, that sounds like a bug in the ODBC driver. – Gord Thompson Mar 06 '20 at 19:23
  • @GordThompson Yes the error occurs when I execute ```read_sql```. Also, the values were really out of range. Around 25600. Unfortunately, there is corrupt data in the 4D database and I can't really do anything about that. I just need to handle it the best I can. – ggrant Mar 09 '20 at 17:32
  • @Parfait I edited it to show the SQL query. It is simply a SELECT * FROM table query. There is no year object. The error happens when I run that one line with ```read_sql```. – ggrant Mar 09 '20 at 17:34
  • You might be able to use an [Output Converter function](https://github.com/mkleehammer/pyodbc/wiki/Using-an-Output-Converter-function) to work around the issue. If you can identify a specific row with bad data in a specific column then you could start by enabling an [ODBC trace](https://github.com/mkleehammer/pyodbc/wiki/Troubleshooting-%E2%80%93-Generating-an-ODBC-trace-log), run a query like`.read_sql("SELECT bad_col FROM table_name WHERE id_col=bad_id", ...` and see the ODBC data type of the offending column. – Gord Thompson Mar 09 '20 at 19:11

1 Answers1

1

You could conceivably use a pyodbc Output Converter function to intercept the corrupted date values and "fix" them using code similar to this:

def unpack_sql_type_timestamp(raw_bytes):
    y, m, d, h, n, s, f = struct.unpack("<h5HI", raw_bytes)
    if y > 9999:
        y = 9999
    elif y < 1:
        y = 1
    return datetime.datetime(y, m, d, h, n, s, f)

pyodbc_connection = pyodbc.connect(connection_string)

pyodbc_connection.add_output_converter(
    pyodbc.SQL_TYPE_TIMESTAMP, 
    unpack_sql_type_timestamp
)

data_chunks = pandas.read_sql_query(
    "SELECT * FROM table_name", 
    con=pyodbc_connection, 
    chunksize=100000
)
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418