0
 dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ","
 }
)

It takes 45 secs to read from S3. Is there any way to optimize the read time?

sheetal_158
  • 7,391
  • 6
  • 27
  • 44

1 Answers1

0

You could try the optimizePerformanceoption if you're using glue 3.0. It batches records to reduce IO. See this for more details

dyf_pagewise_word_count = glueContext.create_dynamic_frame.from_options(
 connection_type="s3",
 format="csv",
 connection_options={
     "paths": ["s3://somefile.csv/"],
     'recurse':True, 
     'groupFiles': 'inPartition', 
     'groupSize': '100000'
 },
 format_options={
     "withHeader": True,
     "separator": ",",
     "optimizePerformance": True, 
 }
)

Also, could you convert the CSV to something like Parquet upstream of the read?

Bob Haffner
  • 8,235
  • 1
  • 36
  • 43