Error in AWS Glue calling pyWriteDynamicFrame parquet column cannot be converted expected double found binary

Question

I am having some troubles to write a dynamic frame from a table that I've read from the Data Catalog. This table is formatted in parquet. When calling writeFrame, AWS Glue is complaining about a column that I am not choosing in my dynamic frame to be written neither exists in my table schema.

The error is:

py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY

Please, could somebody help me with this case?

More details below.. ps: sorry for the long text. I tried to detail as much as possible, and also masked some sensitive data

The parquet files in the table location contain many columns. These parquet files are previously created by a legacy system.

When I call create_dynamic_frame.from_catalog and then, printSchema(), the output shows all the fields that is generated by the legacy system.

Full schema:

root
|-- COD_OPER_CRED: string
|-- COD_PROD_CRED_ITAU: string
|-- NUM_CPF_CNPJ: long
|-- COD_TIPO_PESS: string
|-- SIG3STM: string
|-- ID_OPERACAO: long
|-- COD_TRSF_EXCS_GARA: string
|-- CODMOE: long
|-- ICSTATUS: string
|-- PERCGARF: string
|-- CDSITRIS: double
|-- DTATSITR: double
|-- COD_OPER_RISC_IBBA: long
|-- COD_PRIN_ENCR_OPER: string
|-- QTD_PARC_OPER: long
|-- DAT_TRSF_OPER: timestamp
|-- DAT_LIQU_OPER: timestamp
|-- DAT_ABET_OPER: timestamp
|-- DAT_VCTO_OPER: timestamp
|-- IND_ORIG_OPER_ATAC: string
|-- IND_CORR_MONE_OPER: string
|-- IND_ATUI_OPER: string
|-- DES_MODA_OPER_ATAC: string
|-- DES_TIPO_TXA_OPER: string
|-- DES_INDD_ECOO_OPER_ATAC: string
|-- DES_TIPO_VLRZ_OPER_ATAC: string
|-- QTD_COTA_UTLA_OPER: long
|-- PCT_IND_ECOO: decimal
|-- NUM_BOLE_ORIG: long
|-- NUM_BOLE_ATIV_EFET: long
|-- COD_OPER_ORIG_IBBA: string
|-- COD_ORIG_OPER: string
|-- COD_PESS_ATAC: long
|-- DAT_ALTE_RGTO: timestamp
|-- COD_RSPL_ALTE_RGTO: string
|-- DES_OBS_OPER: string
|-- COD_TIPO_ORIG_OPER_CRED: long
|-- COD_SITU_OPER_CRED: long
|-- IND_ADMP_OPER: string
|-- VLR_CNTR_OPER: double
|-- VLR_PRIN_OPER: double
|-- VLR_SALD_DEVE_OPER: double
|-- VLR_JURO_OPER: double
|-- VLR_VARI_OPER: double
|-- VLR_ATUL_OPER_MOED_ORIG: double
|-- VLR_ATUL_OPER_MOED_DOLA: double
|-- VLR_SFCA_DFCC_OPER: double
|-- VLR_SPRD_OPER: double
|-- anomesdia: string

And here is the show create table output of the aforementioned data catalog table:

CREATE EXTERNAL TABLE `workspace_db.test_minorur_tbgg327_oper_cred_regressao`(
  `cod_oper_cred` string COMMENT '', 
  `cod_prod_cred_itau` string COMMENT '', 
  `num_cpf_cnpj` bigint COMMENT '', 
  `cod_tipo_pess` string COMMENT '', 
  `sig3stm` string COMMENT '', 
  `id_operacao` int COMMENT '', 
  `cod_trsf_excs_gara` string COMMENT '', 
  `codmoe` int COMMENT '', 
  `icstatus` string COMMENT '', 
  `cod_oper_risc_ibba` bigint COMMENT '', 
  `cod_prin_encr_oper` string COMMENT '', 
  `qtd_parc_oper` int COMMENT '', 
  `dat_trsf_oper` date COMMENT '', 
  `dat_liqu_oper` date COMMENT '', 
  `dat_abet_oper` date COMMENT '', 
  `dat_vcto_oper` date COMMENT '', 
  `ind_orig_oper_atac` string COMMENT '', 
  `ind_corr_mone_oper` string COMMENT '', 
  `ind_atui_oper` string COMMENT '', 
  `des_moda_oper_atac` string COMMENT '', 
  `des_tipo_txa_oper` string COMMENT '', 
  `des_indd_ecoo_oper_atac` string COMMENT '', 
  `des_tipo_vlrz_oper_atac` string COMMENT '', 
  `qtd_cota_utla_oper` bigint COMMENT '', 
  `pct_ind_ecoo` decimal(7,4) COMMENT '', 
  `num_bole_orig` int COMMENT '', 
  `num_bole_ativ_efet` int COMMENT '', 
  `cod_oper_orig_ibba` string COMMENT '', 
  `cod_orig_oper` string COMMENT '', 
  `cod_pess_atac` int COMMENT '', 
  `dat_alte_rgto` date COMMENT '', 
  `cod_rspl_alte_rgto` string COMMENT '', 
  `des_obs_oper` string COMMENT '', 
  `cod_tipo_orig_oper_cred` int COMMENT '', 
  `cod_situ_oper_cred` int COMMENT '', 
  `ind_admp_oper` string COMMENT '')
PARTITIONED BY ( 
  `anomesdia` int COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<mybucketname>/<mysourcetablename>/'
TBLPROPERTIES (
)

Here goes the code snippet that I used to try to resolve this case:

import datetime as dt
import sys
from pyspark.sql import DataFrame
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, regexp_extract, to_date
import boto3
from pyspark.sql.types import StringType

# Creating a dynamic frame from existing table in catalog
dyf = self.glueContext.create_dynamic_frame.from_catalog(
    database="workspace_db",
    table_name="<mysourcetablename>",
    transformation_ctx="dyf_oper_cred",
    push_down_predicate="anomesdia=20230406"
    )

print("Dyf schema:\n")
dyf.printSchema()

# Transforming DynamicFrame to DataFrame to use select expressions
df = dyf.toDF()

# Selecting only the desired columns
df_prep = df.selectExpr(
    "DAT_ABET_OPER",
    "COD_TIPO_PESS",
    "NUM_CPF_CNPJ",
    "COD_OPER_CRED",
    "COD_PROD_CRED_ITAU",
    "SIG3STM"
)

print("df_prep schema:\n")
df_prep.printSchema()

print("testing show df_prep")
df_prep.show()

# Adding a partition column
df_partitioned = df_prep.withColumn("anomesdia", lit(20230406))

print("df_partitioned schema:\n")
df_partitioned.printSchema()

print("testing show df_partitioned")
df_partitioned.show(5)

# Transforming dataframe in dynamicframe
dyf_final = DynamicFrame.fromDF(df_spec_prep_partitioned, glueContext, "dyf")

# Configuring datasink
data_sink = self.glueContext.getSink(
            path="s3://<mybucketname>/<myfinaltablename>",
            connection_type="s3",
            updateBehavior="UPDATE_IN_DATABASE",
            partitionKeys="anomesdia",
            compression="snappy",
            enableUpdateCatalog=True,
            transformation_ctx="data_sink",
        )

# Adding entry for data catalog
data_sink.setCatalogInfo(
    catalogDatabase=<mydbname>,
    catalogTableName=<myfinaltablename>
)

data_sink.setFormat(self.args["DATA_FORMAT"],useGlueParquetWriter=True)
        
# Writing in catalog                             
data_sink.writeFrame(dyf)

Notes:

To my data analysis, I just need some fields, so I am using selectExpr in my dataframe to select only the desired fields. (The desired fields would be used later to enrich our Spec Layer, by making joins with other tables)

Then, I transform this dataframe to a dynamic frame, so I am able to write the results in a table in my data catalog.

However, when the writeFrame method is called, it shows the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY

I was a bit confused: why is AWS glue complanining about a column that I am not even using to write my dynamic frame?

So, I tried an approach using DynamicFrame resolveChoice. Below are the snippets that I inserted just after the create_dynamic_frame.from_catalog method:

dyf_resolved = dyf.resolveChoice(choice="make_cols")
print("schema after resolvChoice is:\n")
dyf_resolved.printSchema()

I was expecting that the CDSITRIS field would appear twice: one for double and another for BINARY. However, the printSchema output was the same as before. And the same error persists.

I am not able to modify the parquet file generation in the legacy system, because other teams use these files for another data analysis.

I really would like to learn an approach to overcome this problem: I just need to use some fields of this table. However, AWS Glue is complaining about another field that does not even appear in my table schema.

I tried to use writeFrame, resolveChoice and also selectExpr. I was expecting that this would ignore the CDSITRIS field, but it did not work and I am not able to write my dynamic frame to data catalog.

Please, could somebody help me with this case?

Error in AWS Glue calling pyWriteDynamicFrame parquet column cannot be converted expected double found binary

Notes:

0 Answers0