Merging a Spark DF using INSERT INTO

Question

I have an existing table, which I'd like to append two columns to. I create a Spark dataframe:

spark_df = spark.createDataFrame(df)

Then I'd like to use MERGE INTO as so:

spark.sql(f"""MERGE INTO x.y AS m
USING {spark_df} AS s
ON m.id = s.id
WHEN MATCHED THEN
 UPDATE SET m.id = s.id
WHEN NOT MATCHED 
 THEN INSERT (id, colnew_1) VALUES (id, spark_df[["myCol"]])""")

I retrieve a syntax error when trying to parse the spark_df. Is this functionality possible? I understand that a Delta table is to be created first, so that the MERGE operation is supported. However I'm a bit confused on the sequence of events. For example, I can create a delta table like so:

CREATE TABLE x.y_delta (id bigint, colnew_1 bigint) USING delta

However this table is empty. I suppose an intermediate step is to completely copy the original table, to this new delta table. Then use this delta table accordingly. Though I'm not convinced that this is also right.

You can save `spark_df` as temp view then use it within your merge query — blackbishop, Feb 23 '23 at 15:05

B. B. Naga Sai Vamsi · Accepted Answer · 2023-02-24T08:55:34.987

1

As suggested by @blackbishop, Create temp view for the data frame.

df12.createOrReplaceTempView("temp_table1")

I followed the same suggestion, its working fine .Follow below steps:

Code:

Sample data frame df12:

from pyspark.sql import types as f
df12 = spark.createDataFrame(
    [
        (1,"vam",400),  
        (2,"gov",456)
    ],
    f.StructType(  
        [
            f.StructField("id", f.IntegerType(), True),
            f.StructField("col1", f.StringType(), True),
            f.StructField("myCol", f.IntegerType(), True)
        ]
    ),
)

Create Delta table :

spark.sql("CREATE TABLE x.y_delta2 (id int, col1 string, myCol int) USING delta")
spark.sql("insert into x.y_delta2 values (1,'govind',123),(3,'deep',456)")

enter image description here

Create Temp View

df12.createOrReplaceTempView("temp_table1")

Merge operation:

spark.sql(f"""MERGE INTO x.y_delta2 AS m
USING temp_table1 AS s
ON m.id = s.id
WHEN MATCHED THEN
 UPDATE SET m.id = s.id
WHEN NOT MATCHED 
 THEN INSERT (m.id,m.col1,m.myCol) VALUES (s.id,s.col1,s.myCol)""")

enter image description here

edited Feb 24 '23 at 08:55

answered Feb 24 '23 at 05:40

B. B. Naga Sai Vamsi

2,386
2
3
11

thank you for this, I'm currently working through this and testing it. Just one comment so far: why use "insert into x.y_delta1..."? (I don't see the usage of x.y_delta_1) – pymat Feb 24 '23 at 08:51
That is inserting data into `x.y_delta2`, I think in picture I have misplaced the code while reproducing. But I have updated the code. – B. B. Naga Sai Vamsi Feb 24 '23 at 08:56
ok I thought as much, however instead of inserting the actual values explicitly, this delta table needs to mirror exactly the contents of another table (it'll be appended by two extra columns, for example resulting from your df12). Btw, on this link (https://docs.databricks.com/delta/tutorial.html) it states that "All tables created on Databricks use Delta Lake by default." So I'm wondering if this step is indeed necessary. – pymat Feb 24 '23 at 09:01
yeah, you are right. But I have just did it for this sample demo. Coming to df12, I have used temporary view which converts the dataframe to a temporary view which you can use in your SQL code. This step is done because we want to use the dataframe in the SQL code. – B. B. Naga Sai Vamsi Feb 24 '23 at 09:27
I'm having issues with the the INSERT INTO. I have "INSERT INTO x.y_delta2 VALUES x.y" (the x.y_delta2 has 6 columns, the x.y has 4 columns). I'd like to have the table 'y' saved as delta, therefore all contents in the y_delta2. Is this method still relevant? – pymat Feb 24 '23 at 09:43
This will not work, because the number of columns in the source table (`x.y`) and the target table (`x.y_delta2`) does not match. the extra columns rows may result as null after merge. – B. B. Naga Sai Vamsi Feb 24 '23 at 09:56
yes I get "Cannot write to table due to mismatched user specified column size(4) and data column size(1)" I tried also using like "INSERT INTO x.y_delta2 (col1, col2) SELECT (col1, col2) FROM x.y" What do you reccommend? Please note that the intention of this post is add extra columns to an existing table, using a generated df (therefore the ids have to match) – pymat Feb 24 '23 at 10:01
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/252101/discussion-between-pymat-and-saivamsi). – pymat Feb 24 '23 at 10:10
One thing to add here, is that when using the "intert into" there needs to be compatibility with the number of columns. – pymat Feb 28 '23 at 09:56

Merging a Spark DF using INSERT INTO

1 Answers1