I'm using Dask to transfer table data into parquet files. Some of my columns are nullable, and therefore I need to provide a schema when reading from the database and writing to parquet. When reading the data using read_sql_table
, I create an empty pandas Dataframe, specifying the list of columns and the index, in order for each column to be treated as an object
(nullable) instead of int64
(ID columns).
cols = [
"AFE_ID",
"COST_ENTITY_ID",
"COMMENTS",
"LOCATION",
"LOCATION_CODE",
"CREATED_BY_ID",
"CREATED_DATETIME",
"MODIFIED_BY_ID",
"MODIFIED_DATETIME",
"LOCATION_NAME",
"COMPANY_ID",
"MILLIS_DELETED",
"MILLIS_INACTIVATED",
"DEFAULT_FLAG",
"FREE_TEXT_FLAG",
"LOCATION_ID__1", ## Why is this needed?
]
type_df = pd.DataFrame(columns=cols,
index=pd.Index(name="LOCATION_ID", data=[], dtype="int64")
)
LOCATION_ID
is the primary key of the table. Initially I did not include the last element in the cols
list LOCATION_ID__1
. Subsquently when the data was read
table_df = ddf.read_sql_table(table=query,
index_col="LOCATION_ID",
uri=replica_uri,
schema="XXXXX",
meta=type_df,
)
I then encountered the error:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `pandas.core.frame.DataFrame`
+------------------+-------+----------+
| Column | Found | Expected |
+------------------+-------+----------+
| 'LOCATION_ID__1' | int64 | - |
+------------------+-------+----------+
There is no column in the table called LOCATION_ID__1
; this is the "synthetic" column I'm referring to in the subject line. I first added this "synthetic" column to the start of cols
and that did not work, but adding it to the end did. When the table was written out to parquet and I read it back in to verify, there was no LOCATION_ID__1
column, just a LOCATION_ID
as the index.
Does anyone know why this synthetic column exists? What purpose does it serve?
Dask: 2021.4.1 Pandas: 1.2.4 Numpy: 1.20.3 Pyarrow: 4.0.0 Dagster: 0.11.8