0

I have a python script that cleans some data from a CSV file using MySQL. I want to turn that script into something that Glue can use, but I am unaware of the SQL syntax that can be used with Glue.

This is the statement from my python / mysql script that I need to adapt:

update_sql = """ UPDATE """ + my_table_name + """
        SET identity_LineItemId=NULLIF(identity_LineItemId, ''),
            identity_TimeInterval=NULLIF(identity_TimeInterval, ''),
            bill_InvoiceId=NULLIF(bill_InvoiceId, ''),
            bill_BillingEntity=NULLIF(bill_BillingEntity, ''),
            bill_BillType=NULLIF(bill_BillType, ''),
            Engagement=NULLIF(Engagement, '') ; """

Also how would I select the database in my Glue script? Does standard SQL work?

Any advice I could get on how to convert this update statement from MySQL to a syntqax that Glue would understand would be appreciated! I am real new to Glue.

bluethundr
  • 1,005
  • 17
  • 68
  • 141
  • 1
    Glue (spark) only supports create, select and insert statements. However, there can be alternate ways to achieve that. What is your end goal? From what I understand, you want to load a csv in mysql table and then clean that data using an update command. You don't need a glue job for performing simple update operation in a RDMBS. – Harsh Bafna Sep 23 '19 at 13:53
  • Thank you. My end goal is to clean some data. There is a field in the DB called 'engagement' which is used to track customers in AWS. The engagement tag is supposed to be mandatory but it's not enforced. So some resources have the tag, some have a blank engagement tag, some have bogus info (eg 12345678910) and some of the engagement tags become old and outdated and need to be replaced with the current engagement code. I want to import the bill containing this info from S3 into glue, replace the values I need, and then send it back to S3 for reading by Athena. – bluethundr Sep 23 '19 at 17:15
  • Okay. This is definitely achievable through Glue. However, the output of your glue job will generate new files and not overwrite your source file. What is the size of your data you want to process? May be using Athena will be much more simpler and cost effective for this task. – Harsh Bafna Sep 24 '19 at 02:24
  • Ok thanks. The parquet file that I'm reading is 5GB with 12 million rows. I am planning to send the data to a new S3 bucket that's different from the original file. I thought that Athena can't alter data, and that's why I was using Glue. Please correct me if I'm wrong. – bluethundr Sep 24 '19 at 15:49
  • It's true that, Athena can't alter the source data. However, you can use "Create Table As Select" command, which will allow you to perform the required changes and write the output to a new S3 location in required format and create a new table on top of it. Also, it will be pure SQL unlike glue where you will need to go through spark transformations. Furthermore, Athena is way faster and cost effective compared to Glue – Harsh Bafna Sep 24 '19 at 17:15
  • Ok thanks. Since we've started corresponding I've already gotten most of my transformations working in Glue. So I think I'll stick to that for now, unless I run into any major hurdles. Thank you for your input! – bluethundr Sep 24 '19 at 17:29
  • Ok. That is great to hear. Do let me know in case you need any help with Glue transformation. – Harsh Bafna Sep 24 '19 at 17:31
  • Ok thanks. Actually I was struggling a bit with how to translate this query into something Glue will understand: https://pastebin.com/Tc9v3K4d Would you mind providing a spark example of that statement? Thanks – bluethundr Sep 24 '19 at 17:37
  • 1
    Hey! Actually I worked out that SQL to Spark query on my own. Tho it is hella ugly it does work! lol Getting a working prototype going for now. Hoping to revisit my code later and make it a bit more elegant somewhere down the line. Thanks for the offer of help. – bluethundr Sep 24 '19 at 18:29
  • This is great. Congratulations. And yeah spark code can get ugly, :D – Harsh Bafna Sep 25 '19 at 03:03
  • Awesome, thank you! Yes I agree it can get ugly. Appreciate your input! Thank you. – bluethundr Sep 25 '19 at 14:00
  • Added a sample code in answer as well :-) – Harsh Bafna Sep 25 '19 at 14:21
  • OK, really nice! Thank you! – bluethundr Sep 25 '19 at 14:49

1 Answers1

1

These would be the general steps to perform the task :

  • Create a table in glue catalog pointing to your source CSV data. You can use either Athena or Glue Crawler for this task.
  • Create a dynamic frame in glue script using the above table from glue catalog
  • Convert the dynamic frame to spark dataframe
  • Register above dynamic frame as an in memory temporary table in spark
  • Execute the required SQL query to transform your data on above created temporary table
  • Write the final dataframe back to S3

Sample code :

from pyspark.context import SparkContext
from awsglue.context import GlueContext
spark_session = glueContext.spark_session
sqlContext = SQLContext(spark_session.sparkContext, spark_session)
glueContext = GlueContext(SparkContext.getOrCreate())
dyF = glueContext.create_dynamic_frame.from_catalog(database='{{database}}', table_name='{{table_name}}'
df = dyF.toDF()
df.registerTempTable('{{name}}')
df = sqlContext.sql("{{sql_query_on_above_created_table}}")
df.format('parquet').save('{{s3_location}}')

Spark SQL nullif : https://spark.apache.org/docs/2.3.0/api/sql/index.html#nullif

Harsh Bafna
  • 2,094
  • 1
  • 11
  • 21
  • Thank you very much! This correspondence and your examples have been a great help! However, I have run into a new issue! The script works as is for small data sets. But when I use my full data set (4 parquet files of about 100GB each) the script fails with this error: https://pastebin.com/4WLhh8Gu I think what I have to do is break up the data into smaller chunks and then maybe iterate over them. Is ther any way you can help me past this hurdle? – bluethundr Sep 28 '19 at 16:28
  • Are you running your script on dev endpoint or glue etl? And what's the number of DPU's you are using? – Harsh Bafna Sep 29 '19 at 04:41
  • I've actually resolved the issue with this line of code: `partitioned_dynamicframe = partitioned_dynamicframe.repartition(700)`. Before that the ETL job would fail after 10 hours. Adding that line gets it to complete successfully in under an hour! Thanks for your reply. I think I am good now. I was developing the script in Glue ETL. I need to start using the Dev endpoints. Also for my own information what are DPUs in this context? – bluethundr Sep 29 '19 at 12:54
  • Data Processing Units (or DPUs) are used to run your ETL job. A single Data Processing Unit (DPU) provides 4 vCPU and 16 GB of memory. You are charged based on number of DPUs used and time taken to complete the ETL job. I would recommend to use Glue ETL only for your finished script. You should use Dev EndPoints for development and testing purpose only. – Harsh Bafna Sep 29 '19 at 13:02
  • OK thanks for the info! I'm a lot clearer on how to work with glue at this point. Really appreciate your correspondence! – bluethundr Sep 29 '19 at 13:31