I refer to this question - dask dataframe read parquet schema difference
But the metadata returned by Dask does not indicate any differences between the different dataframes. Here is my code, which parses the exception details to find mismatched dtypes. It finds none. There are up to 100 dataframes with 717 columns (each is ~100MB in size).
try:
df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve), engine='pyarrow')
except Exception as ex:
# Process the ex message to find the diff, this will break if dask change their error message
msgs = str(ex).split('\nvs\n')
cols1 = msgs[0].split('metadata')[0]
cols1 = cols1.split('was different. \n')[1]
cols2 = msgs[1].split('metadata')[0]
df1_err = pd.DataFrame([sub.split(":") for sub in cols1.splitlines()])
df1_err = df1_err.dropna()
df2_err = pd.DataFrame([sub.split(":") for sub in cols2.splitlines()])
df2_err = df2_err.dropna()
df_err = pd.concat([df1_err, df2_err]).drop_duplicates(keep=False)
raise Exception('Mismatch dataframes - ' + str(df_err))
The exception I get back is:
'Mismatch dataframes - Empty DataFrame Columns: [0, 1] Index: []'
This error does not occur with fastparquet, but it is so slow that it is unusable.
I added this to the creation of the dataframes (using pandas to_parquet to save them) in an attempt to unify the dtypes by column
df_float = df.select_dtypes(include=['float16', 'float64'])
df = df.drop(df_float.columns, axis=1)
for col in df_float.columns:
df_float[col] = df_float.loc[:,col].astype('float32')
df = pd.concat([df, df_float], axis=1)
df_int = df.select_dtypes(include=['int8', 'int16', 'int32'])
try:
for col in df_int.columns:
df_int[col] = df_int.loc[:, col].astype('int64')
df = df.drop(df_int.columns, axis=1)
df = pd.concat([df, df_int], axis=1)
except ValueError as ve:
print('Error with upcasting - ' + str(ve))
This appears to work according to my above exception. But I cannot find out how the dataframes differ as the exception thrown by dask read_parquet does not tell me? Ideas on how to determine what it finds as different?