I'm trying to run a sagemaker batch transform job on a large parquet file (2GB) and I keep having issues with it. In my transformer, I have had to specify split_type='Line'
so that I don't get the following error, even when using max_payload=100
Too much data for max payload size
Instead of the above error I get another error when pd.read_parquet(data)
is called
sagemaker_containers._errors.ClientError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
I also tried using max_payload=0
rather than split_type='Line'
, however that consumes too much memory when doing the transformation. So I do want the benefits of splitting the data for parquet files.
Here is my code
transformer = Transformer(
model_name=model_name,
instance_type='ml.m5.4xlarge',
instance_count=1,
output_path=output_path,
accept='application/x-parquet',
strategy='MultiRecord',
max_payload=100,
)
transformer.transform(
data=data,
content_type='application/x-parquet',
split_type='Line',
)
And in the model,
def input_fn(input_data, content_type):
if content_type == 'application/x-parquet':
data = BytesIO(input_data)
df = pd.read_parquet(data)
return df
else:
raise ValueError("{} not supported by script!".format(content_type))
def output_fn(prediction, accept):
if accept == "application/x-parquet":
buffer = BytesIO()
output.to_parquet(buffer)
return buffer.getvalue()
else:
raise Exception("Requested unsupported ContentType in Accept: " + accept)
I have looked into the answers here, and none of them fix the problem that I'm having.
Is there any way that I can use split_type
with the parquet file and not run into this error?