2

I am having some troubles to write a dynamic frame from a table that I've read from the Data Catalog. This table is formatted in parquet. When calling writeFrame, AWS Glue is complaining about a column that I am not choosing in my dynamic frame to be written neither exists in my table schema.

The error is:

py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY

Please, could somebody help me with this case?


More details below.. ps: sorry for the long text. I tried to detail as much as possible, and also masked some sensitive data


The parquet files in the table location contain many columns. These parquet files are previously created by a legacy system.

When I call create_dynamic_frame.from_catalog and then, printSchema(), the output shows all the fields that is generated by the legacy system.

Full schema:
root
|-- COD_OPER_CRED: string
|-- COD_PROD_CRED_ITAU: string
|-- NUM_CPF_CNPJ: long
|-- COD_TIPO_PESS: string
|-- SIG3STM: string
|-- ID_OPERACAO: long
|-- COD_TRSF_EXCS_GARA: string
|-- CODMOE: long
|-- ICSTATUS: string
|-- PERCGARF: string
|-- CDSITRIS: double
|-- DTATSITR: double
|-- COD_OPER_RISC_IBBA: long
|-- COD_PRIN_ENCR_OPER: string
|-- QTD_PARC_OPER: long
|-- DAT_TRSF_OPER: timestamp
|-- DAT_LIQU_OPER: timestamp
|-- DAT_ABET_OPER: timestamp
|-- DAT_VCTO_OPER: timestamp
|-- IND_ORIG_OPER_ATAC: string
|-- IND_CORR_MONE_OPER: string
|-- IND_ATUI_OPER: string
|-- DES_MODA_OPER_ATAC: string
|-- DES_TIPO_TXA_OPER: string
|-- DES_INDD_ECOO_OPER_ATAC: string
|-- DES_TIPO_VLRZ_OPER_ATAC: string
|-- QTD_COTA_UTLA_OPER: long
|-- PCT_IND_ECOO: decimal
|-- NUM_BOLE_ORIG: long
|-- NUM_BOLE_ATIV_EFET: long
|-- COD_OPER_ORIG_IBBA: string
|-- COD_ORIG_OPER: string
|-- COD_PESS_ATAC: long
|-- DAT_ALTE_RGTO: timestamp
|-- COD_RSPL_ALTE_RGTO: string
|-- DES_OBS_OPER: string
|-- COD_TIPO_ORIG_OPER_CRED: long
|-- COD_SITU_OPER_CRED: long
|-- IND_ADMP_OPER: string
|-- VLR_CNTR_OPER: double
|-- VLR_PRIN_OPER: double
|-- VLR_SALD_DEVE_OPER: double
|-- VLR_JURO_OPER: double
|-- VLR_VARI_OPER: double
|-- VLR_ATUL_OPER_MOED_ORIG: double
|-- VLR_ATUL_OPER_MOED_DOLA: double
|-- VLR_SFCA_DFCC_OPER: double
|-- VLR_SPRD_OPER: double
|-- anomesdia: string
And here is the show create table output of the aforementioned data catalog table:
CREATE EXTERNAL TABLE `workspace_db.test_minorur_tbgg327_oper_cred_regressao`(
  `cod_oper_cred` string COMMENT '', 
  `cod_prod_cred_itau` string COMMENT '', 
  `num_cpf_cnpj` bigint COMMENT '', 
  `cod_tipo_pess` string COMMENT '', 
  `sig3stm` string COMMENT '', 
  `id_operacao` int COMMENT '', 
  `cod_trsf_excs_gara` string COMMENT '', 
  `codmoe` int COMMENT '', 
  `icstatus` string COMMENT '', 
  `cod_oper_risc_ibba` bigint COMMENT '', 
  `cod_prin_encr_oper` string COMMENT '', 
  `qtd_parc_oper` int COMMENT '', 
  `dat_trsf_oper` date COMMENT '', 
  `dat_liqu_oper` date COMMENT '', 
  `dat_abet_oper` date COMMENT '', 
  `dat_vcto_oper` date COMMENT '', 
  `ind_orig_oper_atac` string COMMENT '', 
  `ind_corr_mone_oper` string COMMENT '', 
  `ind_atui_oper` string COMMENT '', 
  `des_moda_oper_atac` string COMMENT '', 
  `des_tipo_txa_oper` string COMMENT '', 
  `des_indd_ecoo_oper_atac` string COMMENT '', 
  `des_tipo_vlrz_oper_atac` string COMMENT '', 
  `qtd_cota_utla_oper` bigint COMMENT '', 
  `pct_ind_ecoo` decimal(7,4) COMMENT '', 
  `num_bole_orig` int COMMENT '', 
  `num_bole_ativ_efet` int COMMENT '', 
  `cod_oper_orig_ibba` string COMMENT '', 
  `cod_orig_oper` string COMMENT '', 
  `cod_pess_atac` int COMMENT '', 
  `dat_alte_rgto` date COMMENT '', 
  `cod_rspl_alte_rgto` string COMMENT '', 
  `des_obs_oper` string COMMENT '', 
  `cod_tipo_orig_oper_cred` int COMMENT '', 
  `cod_situ_oper_cred` int COMMENT '', 
  `ind_admp_oper` string COMMENT '')
PARTITIONED BY ( 
  `anomesdia` int COMMENT '')
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://<mybucketname>/<mysourcetablename>/'
TBLPROPERTIES (
)
Here goes the code snippet that I used to try to resolve this case:
import datetime as dt
import sys
from pyspark.sql import DataFrame
from awsglue.dynamicframe import DynamicFrame
from pyspark.sql.functions import col, regexp_extract, to_date
import boto3
from pyspark.sql.types import StringType

# Creating a dynamic frame from existing table in catalog
dyf = self.glueContext.create_dynamic_frame.from_catalog(
    database="workspace_db",
    table_name="<mysourcetablename>",
    transformation_ctx="dyf_oper_cred",
    push_down_predicate="anomesdia=20230406"
    )

print("Dyf schema:\n")
dyf.printSchema()

# Transforming DynamicFrame to DataFrame to use select expressions
df = dyf.toDF()

# Selecting only the desired columns
df_prep = df.selectExpr(
    "DAT_ABET_OPER",
    "COD_TIPO_PESS",
    "NUM_CPF_CNPJ",
    "COD_OPER_CRED",
    "COD_PROD_CRED_ITAU",
    "SIG3STM"
)

print("df_prep schema:\n")
df_prep.printSchema()

print("testing show df_prep")
df_prep.show()

# Adding a partition column
df_partitioned = df_prep.withColumn("anomesdia", lit(20230406))

print("df_partitioned schema:\n")
df_partitioned.printSchema()

print("testing show df_partitioned")
df_partitioned.show(5)

# Transforming dataframe in dynamicframe
dyf_final = DynamicFrame.fromDF(df_spec_prep_partitioned, glueContext, "dyf")

# Configuring datasink
data_sink = self.glueContext.getSink(
            path="s3://<mybucketname>/<myfinaltablename>",
            connection_type="s3",
            updateBehavior="UPDATE_IN_DATABASE",
            partitionKeys="anomesdia",
            compression="snappy",
            enableUpdateCatalog=True,
            transformation_ctx="data_sink",
        )

# Adding entry for data catalog
data_sink.setCatalogInfo(
    catalogDatabase=<mydbname>,
    catalogTableName=<myfinaltablename>
)

data_sink.setFormat(self.args["DATA_FORMAT"],useGlueParquetWriter=True)
        
# Writing in catalog                             
data_sink.writeFrame(dyf)        

Notes:

To my data analysis, I just need some fields, so I am using selectExpr in my dataframe to select only the desired fields. (The desired fields would be used later to enrich our Spec Layer, by making joins with other tables)

Then, I transform this dataframe to a dynamic frame, so I am able to write the results in a table in my data catalog.

However, when the writeFrame method is called, it shows the following error:

py4j.protocol.Py4JJavaError: An error occurred while calling o135.pyWriteDynamicFrame.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 7.0 failed 4 times, most recent failure: Lost task 6.3 in stage 7.0 (TID 129) (172.35.181.16 executor 1): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file glue-d-workspace-db-t-test-minorur-tbgg327-oper-cred-regressao-m-r://<mybucketname>/<mysourcetablename>/anomesdia=20230406/legacy-3e4c49c4-75da-4d84-8b9b-f4e55f729f82.parquet. Column: [CDSITRIS], Expected: double, Found: BINARY

I was a bit confused: why is AWS glue complanining about a column that I am not even using to write my dynamic frame?

So, I tried an approach using DynamicFrame resolveChoice. Below are the snippets that I inserted just after the create_dynamic_frame.from_catalog method:

dyf_resolved = dyf.resolveChoice(choice="make_cols")
print("schema after resolvChoice is:\n")
dyf_resolved.printSchema()

I was expecting that the CDSITRIS field would appear twice: one for double and another for BINARY. However, the printSchema output was the same as before. And the same error persists.


I am not able to modify the parquet file generation in the legacy system, because other teams use these files for another data analysis.

I really would like to learn an approach to overcome this problem: I just need to use some fields of this table. However, AWS Glue is complaining about another field that does not even appear in my table schema.

I tried to use writeFrame, resolveChoice and also selectExpr. I was expecting that this would ignore the CDSITRIS field, but it did not work and I am not able to write my dynamic frame to data catalog.

Please, could somebody help me with this case?

0 Answers0