1

Iam new to AWS-glue and spark. I have to union data from a csv source and a table then load into another table using AWS glue Pyspark code. i tried by writting a code in Apache Zeppelin, its works fine here. but when i place the code and modify it to Glue Pyspark. it giving me an error. but when i access the csv table catalog data and schema in apache zeppelin its working fine.

Error:-

AnalysisException: "Union can only be performed on tables with the same number of columns, but the first table has 0 columns and the second table has 18 columns;;\n'Union\n:- Project\n: +- SubqueryAlias csvtable\n: +- LogicalRDD false\n+- LogicalRDD [policyid#38L, statecode#39, county#40, eq_site_limit#41, hu_site_limit#42, fl_site_limit#43, fr_site_limit#44, tiv_2011#45, tiv_2012#46, eq_site_deductible#47L, hu_site_deductible#48, fl_site_deductible#49L, fr_site_deductible#50L, point_latitude#51, point_longitude#52, line#53, construction#54, point_granularity#55L], false\n"

Source_code:-

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.transforms import *
from awsglue.job import Job
from awsglue.utils import getResolvedOptions

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

# Create a Glue context
glueContext = GlueContext(SparkContext())

spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Create a DynamicFrame using the tables

Source_DynamicFrame = glueContext.create_dynamic_frame.from_catalog(database="star_test_db",table_name="sulfl_insurance_sample_csv",transformation_ctx = "Source_DynamicFrame")
Destination_DynamicFrame = glueContext.create_dynamic_frame.from_catalog(database="star_test_db",table_name="test_dbo_fl_insurance_sample_csv",transformation_ctx = "Destination_DynamicFrame")

Source_DataFrame = Source_DynamicFrame.toDF()
Destination_DataFrame = Destination_DynamicFrame.toDF();

Source_DataFrame.createOrReplaceTempView("csvtable")
tempview_table = spark.sql("SELECT * from csvtable")

spark_data_frame_union = tempview_table.union(Destination_DataFrame)

glue_dynamic_frame_union = DynamicFrame.fromDF(spark_data_frame_union, glueContext, "glue_dynamic_frame_union")

datasink1 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = glue_dynamic_frame_union, catalog_connection = "star_rds_con", connection_options = {"dbtable": "dbo.fl_insurance_sample_csv_join", "database": "test"}, transformation_ctx = "datasink1")

job.commit()

Here i created a tempview for source csv table catalog but i tried with direct dataframes also.

  • can you add tempview_table.show() in your glue code?. I think that data frame has no column. – Mahesh Gupta Dec 11 '19 at 09:28
  • HI @Syed and welcome to StackOverflow. Seems like your ``tempview_table``value is useless since you retrieve all your data from ``Source_DataFrame ``. Can you print your schemas for ``Source_DataFrame`` and ``Destination_DataFrame`` ? – baitmbarek Dec 11 '19 at 09:29
  • try `Source_DataFrame.union(Destination_DataFrame)`. For union, both frames must have same no of columns. So check it before doing union on them. – Sandeep Fatangare Dec 12 '19 at 06:26

0 Answers0