I am having some troubles to write a dynamic frame from a table that I've read from the Data Catalog. This table is formatted in parquet. When calling writeFrame, AWS Glue is complaining about a column that I am not choosing in my dynamic frame to be written neither exists in my table schema.
The error is:
py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY
Please, could somebody help me with this case?
More details below.. ps: sorry for the long text. I tried to detail as much as possible, and also masked some sensitive data
The parquet files in the table location contain many columns. These parquet files are previously created by a legacy system.
When I call create_dynamic_frame.from_catalog
and then, printSchema()
, the output shows all the fields that is generated by the legacy system.
root
|-- COD_OPER_CRED: string
|-- COD_PROD_CRED_ITAU: string
|-- NUM_CPF_CNPJ: long
|-- COD_TIPO_PESS: string
|-- SIG3STM: string
|-- ID_OPERACAO: long
|-- COD_TRSF_EXCS_GARA: string
|-- CODMOE: long
|-- ICSTATUS: string
|-- PERCGARF: string
|-- CDSITRIS: double
|-- DTATSITR: double
|-- COD_OPER_RISC_IBBA: long
|-- COD_PRIN_ENCR_OPER: string
|-- QTD_PARC_OPER: long
|-- DAT_TRSF_OPER: timestamp
|-- DAT_LIQU_OPER: timestamp
|-- DAT_ABET_OPER: timestamp
|-- DAT_VCTO_OPER: timestamp
|-- IND_ORIG_OPER_ATAC: string
|-- IND_CORR_MONE_OPER: string
|-- IND_ATUI_OPER: string
|-- DES_MODA_OPER_ATAC: string
|-- DES_TIPO_TXA_OPER: string
|-- DES_INDD_ECOO_OPER_ATAC: string
|-- DES_TIPO_VLRZ_OPER_ATAC: string
|-- QTD_COTA_UTLA_OPER: long
|-- PCT_IND_ECOO: decimal
|-- NUM_BOLE_ORIG: long
|-- NUM_BOLE_ATIV_EFET: long
|-- COD_OPER_ORIG_IBBA: string
|-- COD_ORIG_OPER: string
|-- COD_PESS_ATAC: long
|-- DAT_ALTE_RGTO: timestamp
|-- COD_RSPL_ALTE_RGTO: string
|-- DES_OBS_OPER: string
|-- COD_TIPO_ORIG_OPER_CRED: long
|-- COD_SITU_OPER_CRED: long
|-- IND_ADMP_OPER: string
|-- VLR_CNTR_OPER: double
|-- VLR_PRIN_OPER: double
|-- VLR_SALD_DEVE_OPER: double
|-- VLR_JURO_OPER: double
|-- VLR_VARI_OPER: double
|-- VLR_ATUL_OPER_MOED_ORIG: double
|-- VLR_ATUL_OPER_MOED_DOLA: double
|-- VLR_SFCA_DFCC_OPER: double
|-- VLR_SPRD_OPER: double
|-- anomesdia: string
And here is the show create table output of the aforementioned data catalog table:
CREATE EXTERNAL TABLE `workspace_db.test_minorur_tbgg327_oper_cred_regressao`(
`cod_oper_cred` string COMMENT '',
`cod_prod_cred_itau` string COMMENT '',
`num_cpf_cnpj` bigint COMMENT '',
`cod_tipo_pess` string COMMENT '',
`sig3stm` string COMMENT '',
`id_operacao` int COMMENT '',
`cod_trsf_excs_gara` string COMMENT '',
`codmoe` int COMMENT '',
`icstatus` string COMMENT '',
`cod_oper_risc_ibba` bigint COMMENT '',
`cod_prin_encr_oper` string COMMENT '',
`qtd_parc_oper` int COMMENT '',
`dat_trsf_oper` date COMMENT '',
`dat_liqu_oper` date COMMENT '',
`dat_abet_oper` date COMMENT '',
`dat_vcto_oper` date COMMENT '',
`ind_orig_oper_atac` string COMMENT '',
`ind_corr_mone_oper` string COMMENT '',
`ind_atui_oper` string COMMENT '',
`des_moda_oper_atac` string COMMENT '',
`des_tipo_txa_oper` string COMMENT '',
`des_indd_ecoo_oper_atac` string COMMENT '',
`des_tipo_vlrz_oper_atac` string COMMENT '',
`qtd_cota_utla_oper` bigint COMMENT '',
`pct_ind_ecoo` decimal(7,4) COMMENT '',
`num_bole_orig` int COMMENT '',
`num_bole_ativ_efet` int COMMENT '',
`cod_oper_orig_ibba` string COMMENT '',
`cod_orig_oper` string COMMENT '',
`cod_pess_atac` int COMMENT '',
`dat_alte_rgto` date COMMENT '',
`cod_rspl_alte_rgto` string COMMENT '',
`des_obs_oper` string COMMENT '',
`cod_tipo_orig_oper_cred` int COMMENT '',
`cod_situ_oper_cred` int COMMENT '',
`ind_admp_oper` string COMMENT '')
PARTITIONED BY (
`anomesdia` int COMMENT '')
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://<mybucketname>/<mysourcetablename>/'
TBLPROPERTIES (
)
Here goes the code snippet that I used to try to resolve this case:
import datetime as dt
import sys
from pyspark.sql import DataFrame
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, regexp_extract, to_date
import boto3
from pyspark.sql.types import StringType
# Creating a dynamic frame from existing table in catalog
dyf = self.glueContext.create_dynamic_frame.from_catalog(
database="workspace_db",
table_name="<mysourcetablename>",
transformation_ctx="dyf_oper_cred",
push_down_predicate="anomesdia=20230406"
)
print("Dyf schema:\n")
dyf.printSchema()
# Transforming DynamicFrame to DataFrame to use select expressions
df = dyf.toDF()
# Selecting only the desired columns
df_prep = df.selectExpr(
"DAT_ABET_OPER",
"COD_TIPO_PESS",
"NUM_CPF_CNPJ",
"COD_OPER_CRED",
"COD_PROD_CRED_ITAU",
"SIG3STM"
)
print("df_prep schema:\n")
df_prep.printSchema()
print("testing show df_prep")
df_prep.show()
# Adding a partition column
df_partitioned = df_prep.withColumn("anomesdia", lit(20230406))
print("df_partitioned schema:\n")
df_partitioned.printSchema()
print("testing show df_partitioned")
df_partitioned.show(5)
# Transforming dataframe in dynamicframe
dyf_final = DynamicFrame.fromDF(df_spec_prep_partitioned, glueContext, "dyf")
# Configuring datasink
data_sink = self.glueContext.getSink(
path="s3://<mybucketname>/<myfinaltablename>",
connection_type="s3",
updateBehavior="UPDATE_IN_DATABASE",
partitionKeys="anomesdia",
compression="snappy",
enableUpdateCatalog=True,
transformation_ctx="data_sink",
)
# Adding entry for data catalog
data_sink.setCatalogInfo(
catalogDatabase=<mydbname>,
catalogTableName=<myfinaltablename>
)
data_sink.setFormat(self.args["DATA_FORMAT"],useGlueParquetWriter=True)
# Writing in catalog
data_sink.writeFrame(dyf)
Notes:
To my data analysis, I just need some fields, so I am using selectExpr
in my dataframe to select only the desired fields.
(The desired fields would be used later to enrich our Spec Layer, by making joins with other tables)
Then, I transform this dataframe to a dynamic frame, so I am able to write the results in a table in my data catalog.
However, when the writeFrame
method is called, it shows the following error:
py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY
I was a bit confused: why is AWS glue complanining about a column that I am not even using to write my dynamic frame?
So, I tried an approach using DynamicFrame resolveChoice.
Below are the snippets that I inserted just after the create_dynamic_frame.from_catalog
method:
dyf_resolved = dyf.resolveChoice(choice="make_cols")
print("schema after resolvChoice is:\n")
dyf_resolved.printSchema()
I was expecting that the CDSITRIS field would appear twice: one for double and another for BINARY. However, the printSchema output was the same as before. And the same error persists.
I am not able to modify the parquet file generation in the legacy system, because other teams use these files for another data analysis.
I really would like to learn an approach to overcome this problem: I just need to use some fields of this table. However, AWS Glue is complaining about another field that does not even appear in my table schema.
I tried to use writeFrame, resolveChoice and also selectExpr. I was expecting that this would ignore the CDSITRIS field, but it did not work and I am not able to write my dynamic frame to data catalog.
Please, could somebody help me with this case?