9

I want to use ETL to read data from S3. Since with ETL jobs I can set DPU to hopefully speed things up.

But how do I do it? I tried

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

inputGDF = glueContext.create_dynamic_frame_from_options(connection_type = "s3", connection_options = {"paths": ["s3://pinfare-glue/testing-csv"]}, format = "csv")
outputGDF = glueContext.write_dynamic_frame.from_options(frame = inputGDF, connection_type = "s3", connection_options = {"path": "s3://pinfare-glue/testing-output"}, format = "parquet")

But it appears there is nothing written. My folder looks like:

enter image description here

Whats incorrect? My output S3 only has a file like: testing_output_$folder$

Vzzarr
  • 4,600
  • 2
  • 43
  • 80
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
  • why aren't using a crawler based approach ? – Kishore Bharathy Nov 02 '18 at 16:39
  • @KishoreBharathy correct me if I am wrong. But I think to convert CSV into Parquet with Crawlers, I need 1 crawler to crawl CSV into Data Catalog. 1 ETL job to convert data in catalog into Parquet in S3. Then another crawler to crawl this parquet files into another catalog for query. This seems very inefficient. Also it appears crawlers does not support bookmarks so I need to crawl my entire data set everytime? – Jiew Meng Nov 03 '18 at 01:46
  • yes, you are right for data files with varying schema mostly having columns appended at the last ! – Kishore Bharathy Nov 05 '18 at 15:23
  • @JiewMeng did you manage to solve this? I'm trying something very similar (convert s3 json files to csv), and I used your code as base. https://stackoverflow.com/questions/56244413/how-to-convert-json-files-stored-in-s3-to-csv-using-glue – fsakiyama May 21 '19 at 20:36

2 Answers2

9

I believe the issue here is that you have subfolders within testing-csv folder and since you did not specify recurse to be true, Glue is not able to find the files in the 2018-09-26 subfolder (or in fact any other subfolders).

You need to add the recurse option as follows

inputGDF = glueContext.create_dynamic_frame_from_options(
    connection_type="s3", format="csv",
    connection_options={"paths": ["s3://pinfare-glue/testing-csv"], "recurse": True})

Also, regarding your question about crawlers in the comments, they help to infer the schema of your data files. So, in your case here does nothing since you are creating the dynamicFrame directly from s3.

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-extensions-glue-context.html#aws-glue-api-crawler-pyspark-extensions-glue-context-create_dynamic_frame_from_options

Vzzarr
  • 4,600
  • 2
  • 43
  • 80
0

Worked for me when I removed the recurse setup.

enter image description here

  • 1
    Welcome to Stack Overflow. [Please don't post screenshots of text](https://meta.stackoverflow.com/a/285557/354577). They can't be searched or copied, or even consumed by users of adaptive technologies like screen readers. Instead, paste the code as text directly into your answer. If you select it and click the `{}` button or Ctrl+K the code block will be indented by four spaces, which will cause it to be rendered as code. – ChrisGPT was on strike May 23 '22 at 18:46