How does awk work with directory of HDFS?

Question

I want to combine the directory name of HDFS with awk. Does this workable? The directory name, not the file name. Here is my awk work fine in the local:

awk 'NR <= 1000 && FNR == 1{print FILENAME}' ./*

And then I want to combine it with hadoop fs -ls like this:

hadoop fs -ls xxx/* | xargs awk 'NR <= 1000 && FNR == 1{print FILENAME}'

but show me: awk: cmd. line:2: fatal: cannot open file `-rwxrwxrwx' for reading (No such file or directory)

I also have tried like:

awk 'NR <= 1000 && FNR == 1{print FILENAME}' < hadoop fs -ls xxx/*
awk 'NR <= 1000 && FNR == 1{print FILENAME}' < $(hadoop fs -ls xxx/*)
awk 'NR <= 1000 && FNR == 1{print FILENAME}' $(hadoop fs -ls xxx/*)

These all failed without surprisingly, I consider awk execute file in the directory need read every file, not like the content of file that can pass it as streaming to awk. Am I right? Who can give me a workable solution to do that? Thanks, advance.

Can you try this? `awk 'NR <= 1000 && FNR == 1{print FILENAME}' <(hadoop fs -ls xxx/*)` — BarathVutukuri, Jul 20 '21 at 09:58
_How to list only the file names in HDFS_ https://stackoverflow.com/questions/21569172/how-to-list-only-the-file-names-in-hdfs and especially this looks promising: https://stackoverflow.com/a/38740023/4162356 — James Brown, Jul 20 '21 at 10:06
@JamesBrown I don't mean to list file names directly, I need to execute the logic in my awk, and then list file names. — Yang Xu, Jul 20 '21 at 10:49
I understand. The error you get implies that `hadoop fs -ls` outputs more file info than just names hence you need to get rid of that extra output. I can't be sure as you didn't show the actual output of the command and therefore I can only offer pointers to solving the problem. Good luck! — James Brown, Jul 20 '21 at 11:04
Sorry, I've mixed up, I get your point now, it works when {print NR} or {prinit FNR} but cannot work with {print FILENAME}, it just displays '-'. here is the one of filename of list: '/user/test/part-00295-3753f202-946c-4a4b-8ae6-c270a2b5048b-c000' after I append this ' | sed '1d;s/ */ /g' | cut -d\ -f8 ' — Yang Xu, Jul 20 '21 at 11:21

kvantour · Accepted Answer · 2021-07-20T15:17:10.437

1

It seems to me that you want to access files that are on a hadoop file-system. This is a virtual file-system, and you only have access to the meta-data of your file. If you want to operate on your file, it is then also important to first copy the file locally. This can be done using hadoop fs -get. After creating a local copy, you can start operating on the files. There is however an alternative way using hadoop fs -cat.

Normally I would say Never parse the output of ls, but with Hadoop, you don't have a choice here. The output of hadoop fs -ls is not similar to the standard output of the Unix/Linux command ls. It is closely related to ls -l and returns the following output:

permissions number_of_replicas userid groupid filesize modification_date modification_time filename

using this and piping it to awk we get a list of files that are of use. So we can now just setup a while-loop:

c=0
while read -r file; do
   [ $c -le 1000 ] && echo "${file}"
   nr=$(hadoop fs -cat "${file}" | wc -l)
   ((c+=nr))
done < <(hadoop fs -ls xxx/* | awk '!/^d/{print substr($0,index($8,$0))}')

note: your initial error was due to the non-unix-like output of hadoop fs -ls. The program awk received a filename -rwxrwxrwx which is actually a permission of the file itself.

edited Jul 20 '21 at 15:17

answered Jul 20 '21 at 12:44

kvantour

25,269
4
47
72

I dont know anything about hadoop file or directory (`xxx`) names but can they contain white space so that `hadoop fs -ls xxx/* | awk '!/^d/{print $NF}'` would fail in those cases? Can the modification date/time format vary depending on how old the file is like with `ls -l`? I see at https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat that hadoop has a `fs -stat` command - that may be more robust and/or simpler to use than `fs -ls`. – Ed Morton Jul 20 '21 at 13:49
It is possible that there are spaces in the files, but it seems to be tremendously difficult to create files with spaces. Hadoop does not come with a simple file listing system or an advanced find program. One could of course create a construct based on the position of `$8`. that might work – kvantour Jul 20 '21 at 15:14
Wouldn't `hadoop fs -stat '%n'` work to just get the file name instead of `hadoop fs -ls`? Idk, just seemed from the man page like that'd work. – Ed Morton Jul 20 '21 at 15:16
1

That might actually work if we do something like `hadoop fs -stat "%n\0"` which would give a robust way for all filenames. – kvantour Jul 20 '21 at 15:19
So, If I want to use awk to operation some files on the HDFS, on the other hand, pass these files as arguments to awk, it is impossible, awk can not read remote files, I've changed to another way. – Yang Xu Jul 21 '21 at 02:13
For a supplement, here is another way as above mentioned to handle ls just show the filename https://stackoverflow.com/questions/21569172/how-to-list-only-the-file-names-in-hdfs – Yang Xu Jul 21 '21 at 02:16
Does $8 with $0 is reverse？ – Yang Xu Jul 21 '21 at 02:29

How does awk work with directory of HDFS?

1 Answers1