fastparquet export for Redshift

Question

I had a very simple idea: Use Python Pandas (for convenience) to do some simple database operations with moderate data amounts and write the data back to S3 in Parquet format. Then, the data should be exposed to Redshift as an external table in order to not take storage space from the actual Redshift cluster.

I found two ways to that.

Given the data:

data = {
    'int': [1, 2, 3, 4, None],
    'float': [1.1, None, 3.4, 4.0, 5.5],
    'str': [None, 'two', 'three', 'four', 'five'],
    'boolean': [True, None, True, False, False],
    'date': [
        date(2000, 1, 1),
        date(2000, 1, 2),
        date(2000, 1, 3),
        date(2000, 1, 4),
        None,
    ],
    'timestamp': [
        datetime(2000, 1, 1, 1, 1, 1),
        datetime(2000, 1, 1, 1, 1, 2),
        None,
        datetime(2000, 1, 1, 1, 1, 4),
        datetime(2000, 1, 1, 1, 1, 5),
    ]
}

df = pd.DataFrame(data)

df['int'] = df['int'].astype(pd.Int64Dtype())
df['date'] = df['date'].astype('datetime64[D]')
df['timestamp'] = df['timestamp'].astype('datetime64[s]')

The type casts at the end are necessary in both cases to assert, that Pandas' type recognition does not interfere.

With PyArrow:

Using Pyarrow, you do it like this:

import pyarrow as pa

pyarrow_schema = pa.schema([
    ('int', pa.int64()),
    ('float', pa.float64()),
    ('str', pa.string()),
    ('bool', pa.bool_()),
    ('date', pa.date64()),
    ('timestamp', pa.timestamp(unit='s'))
])

df.to_parquet(
    path='pyarrow.parquet',
    schema=pyarrow_schema,
    engine='pyarrow'
)

Why use PyArrow: Pandas' default engine for Parquet export is PyArrow, so you can expect good integration. Also, PyArrow provides extensive features and caters for many datatypes.

With fastparquet:

First you need to write out the data with these additional steps:

from fastparquet import write

write('fast.parquet', df, has_nulls=True, times='int96')

The important bit here is the 'times' parameter. See this post, where I found a remedy for the 'date' column.

Why use fastparquet: Fastparquet is much more limited than PyArrow, especially, when it comes to accepted datatypes. On the other hand, the package is much smaller.

The external table:

Given, that you have exported your data to Parquet and stored it in S3, you can then expose it to Redshift like this:

CREATE EXTERNAL TABLE "<your_external_schema>"."<your_table_name>" (
 "int" bigint,
 "float" float,
 "str" varchar(255),
 "boolean" bool,
 "date" date,
 "timestamp" timestamp)
 ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
location 
  's3://<your_bucket>/<your_prefix>/';

Final story and note:

When I started working with Pandas, Parquet and external Redshift tables in the context of AWS Lambda functions, everything was fine for a while. Until I reached a point, where the bundle for my Lambda package reached its allowed limit (Deployment package size). Checking, which of my dependencies made up for all that, I found PyArrow, Pandas and Numpy (dependency of Pandas) to be the culprits. While I could definitely not drop Numpy (for efficiency) and did not want to loose Pandas (convenience, again), I looked to replace PyArrow with something more light-weight. Et voila: Fastparquet. After some research and a lot of experimentation, I could make this also work.

I hope, some other people find this explanation and resources helpful.

I tried the same with fast parquet. My parquet has both int and date_time. With times='int96', hive query can identiify date_time but int32 column is not being identified. Always getting this error: Failed with exception java.io.IOException:java.lang.IllegalStateException: Group type [message schema { optional binary tagname (UTF8); optional int96 date_time; optional binary val (UTF8); optional int32 quality; optional binary i_quality; optional binary i_source (UTF8); optional int96 i_time; optional binary cust; optional binary ast; optional binary date_stamp; } — Ashish Kumar Mondal, Jul 02 '23 at 00:04

score 1 · Accepted Answer · answered Sep 17 '21 at 06:31

1

The question already holds the answer. :)

answered Sep 17 '21 at 06:31

Werner

95
11

fastparquet export for Redshift

With PyArrow:

With fastparquet:

The external table:

Final story and note:

1 Answers1

Linked