Is there the equivalent for a `find` command in `hadoop`?

Question

I know that from the terminal, one can do a find command to find files such as :

find . -type d -name "*something*" -maxdepth 4

But, when I am in the hadoop file system, I have not found a way to do this.

hadoop fs -find ....

throws an error.

How do people traverse files in hadoop? I'm using hadoop 2.6.0-cdh5.4.1.

It "throws an error"? What error? `find` is what I expect most people use. — Dave Newton, Oct 01 '15 at 20:46
for future help-seekers, on `hadoop 2.6.0-cdh5.4.1`, it seems that this doesn't work: `hadoop fs -ls -R `, but a reasonable solution is this: `hadoop fs -ls -R | egrep ` — makansij, Oct 22 '15 at 22:18

score 11 · Answer 1 · answered Oct 01 '15 at 20:52

11

hadoop fs -find was introduced in Apache Hadoop 2.7.0. Most likely you're using an older version hence you don't have it yet. see: HADOOP-8989 for more information.

In the meantime you can use

hdfs dfs -ls -R <pattern>

e.g,: hdfs dfs -ls -R /demo/order*.*

but that's not as powerful as 'find' of course and lacks some basics. From what I understand people have been writing scripts around it to get over this problem.

answered Oct 01 '15 at 20:52

Legato

1,031
1
9
20

Thanks. Any idea how to use the `hadoop fs -find` "expression" option? The docs say: `The following operators are recognised: expression -a expression expression -and expression expression expression` but i have no idea what this means.` – user9074332 Sep 06 '19 at 18:33

score 5 · Answer 2 · answered Apr 04 '17 at 09:51

If you are using the Cloudera stack, try the find tool:

org.apache.solr.hadoop.HdfsFindTool

Set the command to a bash variable:

COMMAND='hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool'

Usage as follows:

${COMMAND} -find . -name "something" -type d ...

score 1 · Answer 3 · answered Oct 30 '18 at 13:22

1

It you don't have the cloudera parcels available you can use awk.

hdfs dfs -ls -R /some_path | awk -F / '/^d/ && (NF <= 5) && /something/'

that's almost equivalent to the find . -type d -name "*something*" -maxdepth 4 command.

answered Oct 30 '18 at 13:22

Emmanuel

177
3

score -1 · Answer 4 · answered Feb 19 '18 at 01:54

adding HdfsFindTool as alias in .bash_profile,will make it easy to use always.

--add below to profile alias hdfsfind='hadoop jar /opt/cloudera/parcels/CDH/lib/solr/contrib/mr/search-mr-job.jar org.apache.solr.hadoop.HdfsFindTool' alias hdfs='hadoop fs'

--u can use as follows now :(here me using find tool to get HDFS source folder wise File name and record counts.)

$> cnt=1;for ff in hdfsfind -find /dev/abc/*/2018/02/16/*.csv -type f; do pp=echo ${ff}|awk -F"/" '{print $7}';fn=basename ${ff}; fcnt=hdfs -cat ${ff}|wc -l; echo "${cnt}=${pp}=${fn}=${fcnt}"; cnt=expr ${cnt} + 1; done

--simple to get folder /file details: $> hdfsfind -find /dev/abc/ -type f -name "*.csv" $> hdfsfind -find /dev/abc/ -type d -name "toys"

Is there the equivalent for a `find` command in `hadoop`?

4 Answers4

Linked