AWS Sagemaker batch transform job with large parquet file and split_type

Question

I'm trying to run a sagemaker batch transform job on a large parquet file (2GB) and I keep having issues with it. In my transformer, I have had to specify split_type='Line' so that I don't get the following error, even when using max_payload=100

Too much data for max payload size

Instead of the above error I get another error when pd.read_parquet(data) is called

sagemaker_containers._errors.ClientError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.

I also tried using max_payload=0 rather than split_type='Line', however that consumes too much memory when doing the transformation. So I do want the benefits of splitting the data for parquet files.

Here is my code

transformer = Transformer(
    model_name=model_name,
    instance_type='ml.m5.4xlarge',
    instance_count=1,
    output_path=output_path,
    accept='application/x-parquet',
    strategy='MultiRecord',
    max_payload=100,
)
transformer.transform(
    data=data, 
    content_type='application/x-parquet',
    split_type='Line',
)

And in the model,

def input_fn(input_data, content_type):
    if content_type == 'application/x-parquet':
        data = BytesIO(input_data)
        df = pd.read_parquet(data)
        return df
    else:
        raise ValueError("{} not supported by script!".format(content_type))


def output_fn(prediction, accept):
    if accept == "application/x-parquet":
        buffer = BytesIO()
        output.to_parquet(buffer)
        return buffer.getvalue()
    else: 
        raise Exception("Requested unsupported ContentType in Accept: " + accept)

I have looked into the answers here, and none of them fix the problem that I'm having.

Is there any way that I can use split_type with the parquet file and not run into this error?

I am not sure if split_type='Line' is appropriate when using parquet. This works for CSV or JSONL typically. It probably doesn't get the parquet to your input_fn(). Can you try running print(data) and print(data.decode('utf-8')) inside and see what it gives as a result? — Giuseppe La Gualano, Nov 05 '22 at 10:31
print(data) just gives a bunch of binary (looking like \x00 or whatever), print(data.decode('utf-8')) gives an error and a corrected version of it still gives an undecipherable result (with special characters). I don't think parquet can be decoded with utf-8. — Jonathon K, Nov 07 '22 at 15:08
I am currently facing the same issue. Have you found a solution? — gilbertocunha, May 18 '23 at 14:44
I didn't end up using a transform job on parquet files, and frankly, we didn't end up using transformers at all because using csv is quite slow. I think this is a limitation of using parquet files since they can't be split easily, being binary. Though this is just off my memory, I haven't worked with this stuff in a while. — Jonathon K, May 19 '23 at 15:30

AWS Sagemaker batch transform job with large parquet file and split_type

0 Answers0