Handling UUID values in Arrow with Parquet files

Question

I'm new to Python and Pandas - please be gentle!

I'm using SqlAlchemy with pymssql to execute a SQL query against a SQL Server database and then convert the result set into a dataframe. I'm then attempting to write this dataframe as a Parquet file:

  engine = sal.create_engine(connectionString)

  conn = engine.connect()
  df = pd.read_sql(query, con=conn)
  df.to_parquet(outputFile)

The data I'm retrieving in the SQL query includes a uniqueidentifier column (i.e. a UUID) named rowguid. Because of this, I'm getting the following error on the last line above:

pyarrow.lib.ArrowInvalid: ("Could not convert UUID('92c4279f-1207-48a3-8448-4636514eb7e2') with type UUID: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column rowguid with type object')

Is there any way I can force all UUIDs to strings at any point in the above chain of events?

A few extra notes:

The goal for this portion of code was to receive the SQL query text as a parameter and act as a generic SQL-to-Parquet function.
I realise I can do something like df['rowguid'] = df['rowguid'].astype(str), but it relies on me knowing which columns have uniqueidentifier types. By the time it's a dataframe, everything is an object and each query will be different.
I also know I can convert it to a char(36) in the SQL query itself, however, I was hoping to do something more "automatic" so the person writing the query doesn't trip over this problem accidentally all the time / doesn't have to remember to always convert the datatype.

Any ideas?

You could inspect the table schema and modify the query on the fly to insert the casts but that would be fairly involved. You could try looking at the first row to see if it is a UUID but that will fail as soon as you hit a query with no rows or all nulls. Perhaps the simplest solution is to just cast everything with a dtype of `object` to str. I don't know enough about pymssql/SqlAlchemy to know which other types might manifest as `object`. If there are other objects then I think you will need to inspect the schema of the table and the cast the matching columns in the dataframe. — Pace, Sep 07 '21 at 20:44
I don't understand why it is so hard, custom serialization will be removed in pyarrow 2.0 and the documentation is very shallow on what to do instead. Just pointing to a IPC functionality that does not deal with this anyhow. There is a description on how to define a UuidType and register it, but it's not saying anything about how to automatically map Uuid to UuidType. The documentation needs to be improved. — seb, May 15 '22 at 01:36

score 0 · Answer 1 · answered Sep 27 '22 at 23:20

Try DuckDB

engine = sal.create_engine(connectionString)

conn = engine.connect()
df = pd.read_sql(query, con=conn)
df.to_parquet(outputFile)

# Close the database connection
conn.close()


# Create DuckDB connection
duck_conn = duckdb.connect(':memory:')

# Write DataFrame content to a snappy compressed parquet file
COPY (SELECT * FROM df) TO 'df-snappy.parquet' (FORMAT 'parquet')

Ref:

appreciate the workaround, but I feel like there should be a way that does not involve installing duckdb. — Alleo, Aug 22 '23 at 06:28

Handling UUID values in Arrow with Parquet files

1 Answers1