EMR Hadoop processing whole S3 file

Question

I have a bunch of small (1KB to 1MB) text files stored in Amazon S3 that I would like to process using Amazon EMR's Hadoop.

Each record given to the mapper needs to contain the entire contents of a text file as well as some way to determine the filename, so I cannot use the default TextInputFormat.

What is the best way to accomplish this? Is there anything else I can do (like copying files from S3 to hdfs) to increase performance?

score 1 · Answer 1 · edited May 23 '17 at 12:02

I had the same issue. Please refer following questions.

If you don't have any large files, but have a lot of files, it's sufficient to use s3cmd get --recursive s3://<url> . command. After retrieved files into EMR instance, you could create tables with Hive. For example, you can load whole files with LOAD DATA statement with partition.

sample

This is a sample code

#!/bin/bash

s3cmd get --recursive s3://your.s3.name .

# create table with partitions
hive -e "SET mapred.input.dir.recursive=true; DROP TABLE IF EXISTS import_s3_data;"
hive -e "CREATE TABLE import_s3_data( rawdata string )
         PARTITIONED BY (tier1 string, tier2, string, tier3 string);"

LOAD_SQL=""

# collect files as array
FILES=(`find . -name \*.txt -print`)

for FILE in ${FILES[@]}
do
    DIR_INFO=(`echo ${FILE##./} | tr -s '/' ' '`)
    T1=${DIR_INFO[0]}
    T2=${DIR_INFO[1]}
    T3=${DIR_INFO[2]}
    LOAD_SQL="${LOAD_SQL} LOAD DATA LOCAL INPATH '${FILE}' INTO TABLE
              import_s3_data PARTITION (tier1 = '${T1}', tier2 = '${T2}', tier3 = '${T3}');"
done
hive -e "${LOAD_SQL}"

another option

I think there are some another options to retrieve small S3 data

S3DistCp ... it will merge small file as large one in order to deal with Hadoop
Hive - External Tables ... it will create an external table referring s3 storages. However, it has almost same performance compared with the case to use s3cmd get. It might more effective in such a case, there are many large raw or gziped files on S3.

score 1 · Answer 2 · answered Sep 05 '19 at 06:46

The best approach according to me would be to create an external table on the CSV files and load it into another table stored again in S3 bucket in parquet format. You will not have to write any script in that case, just few SQL queries.

CREATE EXTERNAL TABLE databasename.CSV_EXT_Module( 
recordType BIGINT, 
servedIMSI BIGINT, 
ggsnAddress STRING, 
chargingID BIGINT, 
...
...
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' 
LOCATION 's3://module/input/csv_files/' 
TBLPROPERTIES ("skip.header.line.count"="1");

The above table will only be an external table mapped to the csv file.

Create another table on top of it if you want the query to run faster:

CREATE TABLE databasename.RAW_Module as
SELECT  
recordType, 
servedIMSI, 
ggsnAddress, 
chargingID,
...
regexp_extract(INPUT__FILE__NAME,'(.*)/(.*)',2) as filename from 
databasename.CSV_EXT_Module
STORED AS PARQUET 
LOCATION 's3://module/raw/parquet_files/';

Change the regexp_extract to have the required input file name.

EMR Hadoop processing whole S3 file

2 Answers2