2

I have a hdfs file list of csv files with same format. I need to be able to LOAD them with pig together. Eg:

/path/to/files/2013/01-01/qwe123.csv
/path/to/files/2013/01-01/asd123.csv
/path/to/files/2013/01-01/zxc321.csv
/path/to/files/2013/01-02/ert435.csv
/path/to/files/2013/01-02/fgh987.csv
/path/to/files/2013/01-03/vbn764.csv

They can not be globed as their name is "random" hash and their directories might contain more csv files.

ddinchev
  • 33,683
  • 28
  • 88
  • 133

2 Answers2

1

You aren't restricted to globbing. Use this:

LOAD '/path/to/files/2013/01-{01/qwe123,01/asd123,01/zxc321,02/ert435,02/fgh987,03/vbn764}.csv';

reo katoa
  • 5,751
  • 1
  • 18
  • 30
  • Parsing the file names would be kind if a burden, also I do not have control over the paths I get in the file. They might change. – ddinchev Aug 16 '13 at 16:08
  • Construct a string like `{path1,path2,path3}` and pass it in as a parameter. – reo katoa Aug 16 '13 at 16:21
  • @Veseliq run a shell script to ls the file names you're interested in and concat them to a string like winnie mentions above. – jtravaglini Aug 16 '13 at 16:22
1

As suggested in other comments, you can do this by pre-processing the file. Suppose your HDFS file is called file_list.txt, then you can do the following:

pig -param flist=`hdfs dfs -cat file_list.txt | awk 'BEGIN{ORS="";}{if (NR == 1) print; else print ","$0;}'` script.pig

The awk code gets rid of the newline characters and uses commas to separate the file names.

In your script (called script.pig in my example), you should use parameter substitution to load the data:

data = LOAD '$flist';
cabad
  • 4,555
  • 1
  • 20
  • 33