0

running spark sql in aws glue returns the column name in the queries

data:

product,price,quantityinKG

mango,100,1

apple,200,3

peach,200,2

mango,200,2

My Test Query

         eg : select product,sum(price)

              from myDataSource

              group by product

The output of the query should be

            product, sum(price)

            mango, 300

but output is :

            product, "sum(price)"

            mango,

There is nothing in the sum(price) column it only has the product nane , please can you help me with this behaviour of glue

bigDataArtist
  • 141
  • 1
  • 12
  • How did you get this data? What type of source was it? If using a crawler perhaps the column names were grabbed as part of the data, which can happen if your settings don't align to your source data file. – jonlegend Jun 29 '21 at 14:12
  • I crawled the data from s3 bucket actually using a glue crawler. have you worked with AWS GLUE – bigDataArtist Jun 29 '21 at 16:39
  • I'm very familiar with glue. What type of data are you crawling? CSV? Can you provide some example data? Are you using a classifier on your crawler? If so, what are the settings on it? – jonlegend Jun 29 '21 at 17:38
  • @jonlegend I have mentioned the dataset in the first line of my question. Yes I am using an inbuilt csv classifier. I didn't get the last part . what are the setting on ? Please can you elaborate the last part – bigDataArtist Jun 29 '21 at 17:57
  • Does your data have "data:" as the first line as in the code block above? – jonlegend Jun 29 '21 at 18:04
  • @jonlegend it starts from " product,price,quantityinKG" that's the first the line , Thanks so much for helping – bigDataArtist Jun 29 '21 at 19:13

1 Answers1

0

First of all, create or replace a local temporary view with your dataframe and then use the sql query

data.createOrReplaceTempView('data_table')
spark.sql("select product, sum(price) as sum_price from data_table group by product").show()

If you are using the glue dynamic dataframe, then first convert it into a spark dataframe using toDF() function before creating the temp view.

  • Its actually an AWS GLUE problem , have you worked with GLUE or is it s general answer related to spark? – bigDataArtist Jun 29 '21 at 16:40
  • I work on AWS Glue, do you think GLUE has a different process to execute the spark sql? – Mahesh Nema Jun 29 '21 at 16:46
  • Apologies , I didn't mean to question your expertise , I feel its more of crawler things , that's what I am guessing. let me paste me script """" def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame: for alias, frame in mapping.items(): frame.toDF().createOrReplaceTempView(alias) result = spark.sql(query) return DynamicFrame.fromDF(result, glueContext, transformation_ctx) SqlQuery0 = ''' select product,sum(price) as total from myDataSource group by product ''' – bigDataArtist Jun 29 '21 at 16:57
  • I actually wrote the script in glue studio as it has a sparkSQL tab where you can only enter sparkSQL code not scripts nothing. – bigDataArtist Jun 29 '21 at 17:04
  • it starts from " product,price,quantityinKG" that's the first the line , Thanks so much for helping – bigDataArtist Jun 29 '21 at 18:12
  • If you think it's a crawler issue I would suggest executing the same query in Athena. – Mahesh Nema Jun 30 '21 at 13:28
  • by running the query in the Athena you can verify whether your crawler crawled the data properly – Mahesh Nema Jun 30 '21 at 15:54
  • Thanks I think why its a crawler issue because the Data was read by it only and that was the only previous step involved.ok ill try to do that . thanks a lot . Will post of there is some more issue. – bigDataArtist Jul 01 '21 at 08:16
  • can you please help me out with this glue issue? Its a diff issue where we write our custom script. – bigDataArtist Jul 08 '21 at 17:41