Record larger than the Split size in AWS GLUE?

Asked May 21 '22 at 02:33

Active Jul 22 '22 at 08:25

Viewed 858 times

I'm Newbie in AWS Glue and Spark. I build my ETL in this. When connect my s3 with files of 200mb approximately not read this. The error is that

An error was encountered:
An error occurred while calling o99.toDF.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 10.0 failed 1 times, most recent failure: Lost task 1.0 in stage 10.0 (TID 16) (91ec547edca7 executor driver): com.amazonaws.services.glue.util.NonFatalException: Record larger than the Split size: 67108864

Update 1: When split my json file(200mb) with jq, in two parts AWS GLUE, read with normally both parts

My solution is a lambda splitting file, but i want to know how aws glue split works Thanks and Regards

edited May 23 '22 at 08:11

asked May 21 '22 at 02:33

Vitualizz Uzumaki

This is a little bit of a guess here.. But I think its complaining that you have a *record* that exceeds the *file* split size. What type of files are these? Maybe you specified the wrong delimiter on a csv? – Bob Haffner May 21 '22 at 17:49
Is a JSON file, I thought AWS split large files automatically. – Vitualizz Uzumaki May 22 '22 at 03:27
Glue/Spark will split files, but not records. Which I think is the issue. Perhaps there's a way to increase the max split size that will accommodate these large records. Or perhaps there's a format issue with your JSON – Bob Haffner May 22 '22 at 03:56
Uhmm, for example my JSON has 40K Records, so the problem is a JSON format. But with small files (50mb) all good :/ – Vitualizz Uzumaki May 22 '22 at 04:28

Record larger than the Split size in AWS GLUE?

0 Answers0