Executing spark sql in aws glue returns the column name in the queries rather than values

Question

running spark sql in aws glue returns the column name in the queries

data:

product,price,quantityinKG

mango,100,1

apple,200,3

peach,200,2

mango,200,2

My Test Query

         eg : select product,sum(price)

              from myDataSource

              group by product

The output of the query should be

            product, sum(price)

            mango, 300

but output is :

            product, "sum(price)"

            mango,

There is nothing in the sum(price) column it only has the product nane , please can you help me with this behaviour of glue

How did you get this data? What type of source was it? If using a crawler perhaps the column names were grabbed as part of the data, which can happen if your settings don't align to your source data file. — jonlegend, Jun 29 '21 at 14:12
I crawled the data from s3 bucket actually using a glue crawler. have you worked with AWS GLUE — bigDataArtist, Jun 29 '21 at 16:39
I'm very familiar with glue. What type of data are you crawling? CSV? Can you provide some example data? Are you using a classifier on your crawler? If so, what are the settings on it? — jonlegend, Jun 29 '21 at 17:38
@jonlegend I have mentioned the dataset in the first line of my question. Yes I am using an inbuilt csv classifier. I didn't get the last part . what are the setting on ? Please can you elaborate the last part — bigDataArtist, Jun 29 '21 at 17:57
Does your data have "data:" as the first line as in the code block above? — jonlegend, Jun 29 '21 at 18:04
@jonlegend it starts from " product,price,quantityinKG" that's the first the line , Thanks so much for helping — bigDataArtist, Jun 29 '21 at 19:13

score 0 · Answer 1 · answered Jun 29 '21 at 13:42

0

First of all, create or replace a local temporary view with your dataframe and then use the sql query

data.createOrReplaceTempView('data_table')
spark.sql("select product, sum(price) as sum_price from data_table group by product").show()

If you are using the glue dynamic dataframe, then first convert it into a spark dataframe using toDF() function before creating the temp view.

answered Jun 29 '21 at 13:42

Mahesh Nema

31
5

Its actually an AWS GLUE problem , have you worked with GLUE or is it s general answer related to spark? – bigDataArtist Jun 29 '21 at 16:40
I work on AWS Glue, do you think GLUE has a different process to execute the spark sql? – Mahesh Nema Jun 29 '21 at 16:46
Apologies , I didn't mean to question your expertise , I feel its more of crawler things , that's what I am guessing. let me paste me script """" def sparkSqlQuery(glueContext, query, mapping, transformation_ctx) -> DynamicFrame: for alias, frame in mapping.items(): frame.toDF().createOrReplaceTempView(alias) result = spark.sql(query) return DynamicFrame.fromDF(result, glueContext, transformation_ctx) SqlQuery0 = ''' select product,sum(price) as total from myDataSource group by product ''' – bigDataArtist Jun 29 '21 at 16:57
I actually wrote the script in glue studio as it has a sparkSQL tab where you can only enter sparkSQL code not scripts nothing. – bigDataArtist Jun 29 '21 at 17:04
it starts from " product,price,quantityinKG" that's the first the line , Thanks so much for helping – bigDataArtist Jun 29 '21 at 18:12
If you think it's a crawler issue I would suggest executing the same query in Athena. – Mahesh Nema Jun 30 '21 at 13:28
by running the query in the Athena you can verify whether your crawler crawled the data properly – Mahesh Nema Jun 30 '21 at 15:54
Thanks I think why its a crawler issue because the Data was read by it only and that was the only previous step involved.ok ill try to do that . thanks a lot . Will post of there is some more issue. – bigDataArtist Jul 01 '21 at 08:16
can you please help me out with this glue issue? Its a diff issue where we write our custom script. – bigDataArtist Jul 08 '21 at 17:41

Executing spark sql in aws glue returns the column name in the queries rather than values

1 Answers1