Iam new to AWS-glue and spark. I have to union data from a csv source and a table then load into another table using AWS glue Pyspark code. i tried by writting a code in Apache Zeppelin, its works fine here. but when i place the code and modify it to Glue Pyspark. it giving me an error. but when i access the csv table catalog data and schema in apache zeppelin its working fine.
Error:-
AnalysisException: "Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 18 columns;;\n'Union\n:- Project\n: +- SubqueryAlias
csvtable
\n: +- LogicalRDD false\n+- LogicalRDD [policyid#38L, statecode#39, county#40, eq_site_limit#41, hu_site_limit#42, fl_site_limit#43, fr_site_limit#44, tiv_2011#45, tiv_2012#46, eq_site_deductible#47L, hu_site_deductible#48, fl_site_deductible#49L, fr_site_deductible#50L, point_latitude#51, point_longitude#52, line#53, construction#54, point_granularity#55L], false\n"
Source_code:-
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import *
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
# Create a Glue context
glueContext = GlueContext(SparkContext())
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Create a DynamicFrame using the tables
Source_DynamicFrame = glueContext.create_dynamic_frame.from_catalog(database="star_test_db",table_name="sulfl_insurance_sample_csv",transformation_ctx = "Source_DynamicFrame")
Destination_DynamicFrame = glueContext.create_dynamic_frame.from_catalog(database="star_test_db",table_name="test_dbo_fl_insurance_sample_csv",transformation_ctx = "Destination_DynamicFrame")
Source_DataFrame = Source_DynamicFrame.toDF()
Destination_DataFrame = Destination_DynamicFrame.toDF();
Source_DataFrame.createOrReplaceTempView("csvtable")
tempview_table = spark.sql("SELECT * from csvtable")
spark_data_frame_union = tempview_table.union(Destination_DataFrame)
glue_dynamic_frame_union = DynamicFrame.fromDF(spark_data_frame_union, glueContext, "glue_dynamic_frame_union")
datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = glue_dynamic_frame_union, catalog_connection = "star_rds_con", connection_options = {"dbtable": "dbo.fl_insurance_sample_csv_join", "database": "test"}, transformation_ctx = "datasink1")
job.commit()
Here i created a tempview for source csv table catalog but i tried with direct dataframes also.