Load multiple files with PigLatin (Hadoop)

Question

I have a hdfs file list of csv files with same format. I need to be able to LOAD them with pig together. Eg:

/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv

They can not be globed as their name is "random" hash and their directories might contain more csv files.

http://stackoverflow.com/a/7341236/2103450 – jtravaglini Aug 16 '13 at 16:03 — jtravaglini, Aug 16 '13 at 16:03
So just separating with comma, ha! – ddinchev Aug 16 '13 at 16:08 — ddinchev, Aug 16 '13 at 16:08

score 1 · Answer 1 · answered Aug 16 '13 at 16:04

1

You aren't restricted to globbing. Use this:

LOAD '/path/to/files/2013/01-{01/qwe123,01/asd123,01/zxc321,02/ert435,02/fgh987,03/vbn764}.csv';

answered Aug 16 '13 at 16:04

reo katoa

5,751
1
18
30

Parsing the file names would be kind if a burden, also I do not have control over the paths I get in the file. They might change. – ddinchev Aug 16 '13 at 16:08
Construct a string like `{path1,path2,path3}` and pass it in as a parameter. – reo katoa Aug 16 '13 at 16:21
@Veseliq run a shell script to ls the file names you're interested in and concat them to a string like winnie mentions above. – jtravaglini Aug 16 '13 at 16:22

cabad · Accepted Answer · 2013-08-16T17:46:55.247

1

As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:

pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig

The awk code gets rid of the newline characters and uses commas to separate the file names.

In your script (called script.pig in my example), you should use parameter substitution to load the data:

data = LOAD '$flist';

edited Aug 16 '13 at 17:46

answered Aug 16 '13 at 17:01

cabad

4,555
1
20
33

Do you have any idea how to escape the separator? Because my file names seem to contain commas :( – ddinchev Aug 26 '13 at 11:07

Load multiple files with PigLatin (Hadoop)

2 Answers2

Linked