AWS Athena insert into with named columns not working in pyspark

Question

I've created a little test table using pyspark

query="""
CREATE EXTERNAL TABLE IF NOT EXISTS test1
(
c1 INT,
c2 INT,
c3 INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://mybucket/myfolder/'
"""
spark.sql(query)

And this works fine , producing the following output

spark.sql("select * from test1").show()

+---+---+---+
| c1| c2| c3|
+---+---+---+
|  1|  2|  3|
|  4|  5|  6|
+---+---+---+

My problem is trying to do an insert now. According to my reading of the Athena documentation I should be able to do the following but I'm getting an error message

query="""
insert into test1(c1,c2,c3) select c1,c2,c3 from test1
"""
spark.sql(query)


"\nmismatched input 'c1' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 2, pos 21)\n\n== SQL ==\n\ninsert into test1(c1,c2,c3) select c1,c2,c3 from test1\n---------------------^^^\n"
Traceback (most recent call last):
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 767, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 73, in deco
    raise ParseException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.ParseException: "\nmismatched input 'c1' expecting {'(', 'SELECT', 'FROM', 'VALUES', 'TABLE', 'INSERT', 'MAP', 'REDUCE'}(line 2, pos 21)\n\n== SQL ==\n\ninsert into test1(c1,c2,c3) select c1,c2,c3 from test1\n---------------------^^^\n"

However the following INSERT works as expected

query="""
insert into test1 select c1,c2,c3 from test1
"""

spark.sql(query)

If anyone can see what I'm doing wrong it would be appreciated

This might help: https://stackoverflow.com/questions/54102101/spark-sql-insert-select-with-a-column-list (Spark SQL does not accept a column list in Insert SQL statement.) — Prabhatika Vij, May 17 '23 at 03:05

score 0 · Answer 1 · answered Oct 03 '19 at 17:03

0

As per AWS documentation, you don't need to pass the column names alongwith the destination table. The correct query would be :

insert into test1 select c1,c2,c3 from test1

Reference : Athena insert into documentation

answered Oct 03 '19 at 17:03

Harsh Bafna

2,094
1
11
21

If you go a bit further down the page in the documentation link you provided you will see the following example : INSERT INTO cities_usa (city,state) SELECT city,state FROM cities_world WHERE country='usa' What would you do if your source column names were different from your target column names? – user2699504 Oct 07 '19 at 12:34
That is for insert into...values, not insert into...select. – Harsh Bafna Oct 07 '19 at 12:49
You are mistaken. The example I gave in my comment IS an INSERT INTO ... SELECT ... NOT an INSERT ... VALUES and is taken directly from the documentation – user2699504 Oct 10 '19 at 15:00

AWS Athena insert into with named columns not working in pyspark

1 Answers1