0

I refer to this question - dask dataframe read parquet schema difference

But the metadata returned by Dask does not indicate any differences between the different dataframes. Here is my code, which parses the exception details to find mismatched dtypes. It finds none. There are up to 100 dataframes with 717 columns (each is ~100MB in size).

    try:
        df = dd.read_parquet(data_filenames, columns=list(cols_to_retrieve), engine='pyarrow')
    except Exception as ex:
        # Process the ex message to find the diff, this will break if dask change their error message
        msgs = str(ex).split('\nvs\n')
        cols1 = msgs[0].split('metadata')[0]
        cols1 = cols1.split('was different. \n')[1]
        cols2 = msgs[1].split('metadata')[0]
        df1_err = pd.DataFrame([sub.split(":") for sub in cols1.splitlines()])
        df1_err = df1_err.dropna()
        df2_err = pd.DataFrame([sub.split(":") for sub in cols2.splitlines()])
        df2_err = df2_err.dropna()
        df_err = pd.concat([df1_err, df2_err]).drop_duplicates(keep=False)
        raise Exception('Mismatch dataframes - ' + str(df_err))

The exception I get back is:

'Mismatch dataframes - Empty DataFrame Columns: [0, 1] Index: []'

This error does not occur with fastparquet, but it is so slow that it is unusable.

I added this to the creation of the dataframes (using pandas to_parquet to save them) in an attempt to unify the dtypes by column

    df_float = df.select_dtypes(include=['float16', 'float64'])
    df = df.drop(df_float.columns, axis=1)

    for col in df_float.columns:
        df_float[col] = df_float.loc[:,col].astype('float32')

    df = pd.concat([df, df_float], axis=1)

    df_int = df.select_dtypes(include=['int8', 'int16', 'int32'])

    try:
        for col in df_int.columns:
            df_int[col] = df_int.loc[:, col].astype('int64')
        df = df.drop(df_int.columns, axis=1)
        df = pd.concat([df, df_int], axis=1)
    except ValueError as ve:
        print('Error with upcasting - ' + str(ve))

This appears to work according to my above exception. But I cannot find out how the dataframes differ as the exception thrown by dask read_parquet does not tell me? Ideas on how to determine what it finds as different?

Olddave
  • 397
  • 1
  • 2
  • 12
  • For reading with fastparquet, depending on your version of dask, you may be able to pass `infer_divisions=False` to avoid scanning all the files ahead of time; or use fastparquet itself to create a metadata file one time only, thereafter to read that file quickly. – mdurant Jan 09 '19 at 15:57
  • All well and good, but prefer to use pyarrow, if only it's exception reporting was better... – Olddave Jan 09 '19 at 22:48
  • I may be wrong, but I think that the metadata file that fastparquet can make for you, may also speed up pyarrow and squash this error. Worth a try (but yes, the situation should have been better anyway). – mdurant Jan 09 '19 at 22:55
  • I could not find any information on getting pyarrow to use fastparquet metadata. The closet was this - https://stackoverflow.com/questions/52122674/how-to-write-parquet-metadata-with-pyarrow?rq=1 – Olddave Jan 10 '19 at 12:32

1 Answers1

0

You could use the fastparquet function merge to create a metadata file from the many data files (this will take some time to scan all the files). Thereafter, pyarrow will use this metadata files, and that might be enough to get rid of the problem for you.

mdurant
  • 27,272
  • 5
  • 45
  • 74
  • 1
    Unfortunately the merge just reports what I already know, schema differences. I have abandoned this approach and will use Dask to manage the dtypes when it appends each DataFrame I process – Olddave Jan 11 '19 at 18:05