Pig: How to load the output of an hdfs ls into an alias?

Question

I'm trying to look at the files in my hdfs and assess which ones are older than a certain date. I'd like to do an hdfs ls and pass the output of that into a pig LOAD command.

In an answer to How Can I Load Every File In a Folder Using PIG? @DonaldMiner includes a shell script that outputs the filenames; I borrowed this to pass in a list of filenames. However, I don't want to load the contents of the files, I just want to load the output of the ls command and treat the filenames as text.

Here is myfirstscript.pig:

test = LOAD '$files' as (moddate:chararray, modtime:chararray, filename:chararray);

illustrate test;

which I call thusly:

pig -p files="`./filesysoutput.sh`" myfirstscript.pig

where filesysoutput.sh contains:

hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}' | tr '\n' ','

This generates output like:

2012-07-27 17:56 /hbase/imagestore/.tableinfo.0000000001,2012-04-23 19:27 /hbase/imagestore/08e36507d743367e1de57c359360b64c/.regioninfo,2012-05-10 12:13 /hbase/imagestore/08e36507d743367e1de57c359360b64c/0/7818124910159371133,2012-05-10 15:01 /hbase/imagestore/08e36507d743367e1de57c359360b64c/1/5537238047267916113,2012-05-09 19:40 /hbase/imagestore/08e36507d743367e1de57c359360b64c/2/6836317764645542272,2012-05-10 07:04 /hbase/imagestore/08e36507d743367e1de57c359360b64c/3/7276147895747401630,...

Since all I want is the date and time and file name, I am only including those fields in the output that is fed into the pig script. When I try to run this, it definitely is trying to load the actual files into the test alias:

 $ pig -p files="`./filesysoutput.sh`" myfirstscript.pig 
2013-05-29 17:40:10.773 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:10.827 java[50830:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20,570 [main] INFO  org.apache.pig.Main - Logging error messages to: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log
2013-05-29 17:40:20,769 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://stage-hadoop101.cluster:8020
2013-05-29 17:40:20,771 [main] WARN  org.apache.hadoop.conf.Configuration - mapred.used.genericoptionsparser is deprecated. Instead, use mapreduce.client.genericoptionsparser.used
2013-05-29 17:40:20,773 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:20.836 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:20.879 java[50847:1203] Unable to load realm info from SCDynamicStore
2013-05-29 17:40:21,138 [main] WARN  org.apache.hadoop.conf.Configuration - fs.default.name is deprecated. Instead, use fs.defaultFS
2013-05-29 17:40:21,452 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1200: Pig script failed to parse: 
<file myfirstscript.pig, line 3, column 7> pig script failed to validate: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: 2012-07-27 17:56%20/hbase/imagestore/.tableinfo.0000000001
Details at logfile: /Users/username/Environment/pig-0.9.2-cdh4.0.1/scripts/test/pig_1369863620569.log

Please help with this: http://stackoverflow.com/questions/38706919/funtion-to-convert-specific-date-range-to-hdfs-glob-pattern — user2924175, Aug 02 '16 at 20:48
Sorry - I haven't been using pig for about 3 years, so I don't know how much help I can be. — barclay, Aug 03 '16 at 22:49

Chris White · Accepted Answer · 2013-05-30T01:24:50.450

3

You could try an alternative approach - using a dummy.txt input file (with a single line) and then use the STREAM alias THROUGH command to process the output of the hadoop fs -ls as you currently are:

grunt> dummy = load '/tmp/dummy.txt';   
grunt> fs -cat /tmp/dummy.txt;
dummy
grunt> files = STREAM dummy THROUGH 
    `hadoop fs -ls -R /hbase/imagestore | grep '\-rw' | awk 'BEGIN { OFS="\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $8}'` 
    AS (moddate:chararray, modtime:chararray, filename:chararray);

Note the above is untested - i mocked up something similar with local mode pig and it worked (note i added some options to awk OFS and had to change the grep slightly):

grunt> files = STREAM dummy THROUGH \
    `ls -l | grep "\\-rw" | awk 'BEGIN { OFS = "\t"; FS = ",[ \t]*|[ \t]+" } {print $6, $7, $9}'` \
     AS (month:chararray, day:chararray, file:chararray);

grunt> dump files

(Dec,31,CTX.DAT)
(Dec,23,examples.desktop)
(Feb,8,installer.img.gz)
(Feb,8,install.py)
(Apr,25,mapred-site.xml)
(Apr,14,sqlite)

edited May 30 '13 at 01:24

answered May 30 '13 at 00:13

Chris White

29,949
4
71
93

it's a really cool idea, but after playing around with it a lot i just don't think this works. i am able to get it to run and not fail; but while it claims it "Successfully stored records in: 'file:/tmp/temp168676408/tmp-77338624'", that location doesn't exist on my system. the same outcome occurs in local or mapreduce mode. i think i am going to have to try creating my own external script using Perl. – barclay May 31 '13 at 14:02
YES! - i didn't understand that whatever content in dummy.txt gets streamed through the script, or whatever is within backticks; i had an empty dummy.txt file, so nothing was there to stream through the script; so nothing was coming out of the script and getting assigned to the alias. i did see that you had content in yours, but i didn't understand why that was important, so i didn't include it. once i put it in, i got content out of my script. hope i'm making sense. thanks @ChrisWhite, a winnar is you! – barclay May 31 '13 at 14:33

score 0 · Answer 2 · answered Jun 01 '13 at 03:25

0

How about using Embedded pig based on java or python?

Embedded pig

answered Jun 01 '13 at 03:25

satish

246
2
9

Pig: How to load the output of an hdfs ls into an alias?

2 Answers2