Glue PySpark - Fixed Width File Processing and Validation

Question

I have to validate Fixed Width files that I am reading from S3 to Glue. I have the lengths of each column and I have to write a glue job to validate these files.

How do I efficiently check the lengths of every row to filter out the records which don't have the correct total_length?

What is the best way to read such files?

I tried reading it as CSV into one col0 in the Dynamic Frame and tried to filter out length using FILTER but this gives me a dictionary

bad_length_DF = dynamicFramerawtxt.filter(lambda x: len(x['col0']) != total_row_len)

How do I remove the records from my Dynamic Frame that have wrong lengths and create a an ERROR_Dynamic frame?

score 2 · Answer 1 · answered May 09 '23 at 12:20

My general recommendation is to use Spark dataframe instead of Glue dynamicframe unless you need to use the built-in transformations (doc) or Glue job bookmark (doc).

Below is a complete PySpark script for your scenario.

from pyspark.sql import SparkSession
from pyspark.sql.functions import length
spark = SparkSession.builder.getOrCreate()


data = [
    {"col0": "0123456789"},
    {"col0": "01234567890"}
]
total_length = 10

df = spark.createDataFrame(data)
df_bad = df.filter(length(df.col0)!=total_length)
df_good = df.filter(length(df.col0)==total_length)

Eren Sakarya · Answer 2 · 2023-05-23T10:12:15.183

1

If you want to use Dynamic Frame then you can use it like below. But the column type must be string.

from awsglue.transforms import *


raw_dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options= {'paths':f"s3://<bucket_name>/<subfolders>/"]},
    format="<data_type>")
good_length_dyf = Filter.apply(frame=raw_dyf, f=lambda x:len(x['<column_name>']) == total_row_len)
bad_length_dyf = Filter.apply(frame=raw_dyf, f=lambda x:len(x['<column_name>']) != total_row_len)

Thanks.

edited May 23 '23 at 10:12

answered May 09 '23 at 13:54

Eren Sakarya

150
8

1

hey, thanks for the answer...In my specific case, we are not using Glue Catalog. I have to directly read using from_options. – Samarth Navneet May 23 '23 at 03:42
1

Just updated my answer. – Eren Sakarya May 23 '23 at 10:12

Glue PySpark - Fixed Width File Processing and Validation

2 Answers2