Synthetic column required by read_sql_table's meta

Question

I'm using Dask to transfer table data into parquet files. Some of my columns are nullable, and therefore I need to provide a schema when reading from the database and writing to parquet. When reading the data using read_sql_table, I create an empty pandas Dataframe, specifying the list of columns and the index, in order for each column to be treated as an object (nullable) instead of int64 (ID columns).

 cols = [    
    "AFE_ID",
    "COST_ENTITY_ID",
    "COMMENTS",
    "LOCATION",
    "LOCATION_CODE",
    "CREATED_BY_ID",
    "CREATED_DATETIME",
    "MODIFIED_BY_ID",
    "MODIFIED_DATETIME",
    "LOCATION_NAME",
    "COMPANY_ID",
    "MILLIS_DELETED",
    "MILLIS_INACTIVATED",
    "DEFAULT_FLAG",
    "FREE_TEXT_FLAG",
    "LOCATION_ID__1", ## Why is this needed?   
    ]
type_df = pd.DataFrame(columns=cols,
                       index=pd.Index(name="LOCATION_ID", data=[], dtype="int64")
                       )

LOCATION_ID is the primary key of the table. Initially I did not include the last element in the cols list LOCATION_ID__1. Subsquently when the data was read

table_df = ddf.read_sql_table(table=query,
                              index_col="LOCATION_ID",
                              uri=replica_uri,
                              schema="XXXXX",
                              meta=type_df,
                              )

I then encountered the error:

ValueError: Metadata mismatch found in `from_delayed`.

Partition type: `pandas.core.frame.DataFrame`
+------------------+-------+----------+
| Column           | Found | Expected |
+------------------+-------+----------+
| 'LOCATION_ID__1' | int64 | -        |
+------------------+-------+----------+

There is no column in the table called LOCATION_ID__1; this is the "synthetic" column I'm referring to in the subject line. I first added this "synthetic" column to the start of cols and that did not work, but adding it to the end did. When the table was written out to parquet and I read it back in to verify, there was no LOCATION_ID__1 column, just a LOCATION_ID as the index.

Does anyone know why this synthetic column exists? What purpose does it serve?

Dask: 2021.4.1 Pandas: 1.2.4 Numpy: 1.20.3 Pyarrow: 4.0.0 Dagster: 0.11.8

Does this answer your question? [DASK Metadata mismatch found in 'from\_delayed' JSON file](https://stackoverflow.com/questions/54992783/dask-metadata-mismatch-found-in-from-delayed-json-file) — sophros, Aug 19 '21 at 18:09
So weird! Can you execute the same query (or the first 10 rows) using pandas directly? — mdurant, Aug 21 '21 at 00:59

Synthetic column required by read_sql_table's meta

0 Answers0