The best approach according to me would be to create an external table
on the CSV files and load it into another table stored again in S3
bucket in parquet
format. You will not have to write any script in that case, just few SQL queries.
CREATE EXTERNAL TABLE databasename.CSV_EXT_Module(
recordType BIGINT,
servedIMSI BIGINT,
ggsnAddress STRING,
chargingID BIGINT,
...
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 's3://module/input/csv_files/'
TBLPROPERTIES ("skip.header.line.count"="1");
The above table will only be an external table mapped to the csv file.
Create another table on top of it if you want the query to run faster:
CREATE TABLE databasename.RAW_Module as
SELECT
recordType,
servedIMSI,
ggsnAddress,
chargingID,
...
regexp_extract(INPUT__FILE__NAME,'(.*)/(.*)',2) as filename from
databasename.CSV_EXT_Module
STORED AS PARQUET
LOCATION 's3://module/raw/parquet_files/';
Change the regexp_extract to have the required input file name.