Grep across multiple files in Hadoop Filesystem

Question

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

I can see the files I wish to search like this:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..which returns several entries like this:

-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

Problem with this is, its not a UNIX file system, its a Hadoop File System, whenever I try to do like this `bash-3.00$ cd /apps/hdmi-technology/b_dps/real-time bash: cd: /apps/hdmi-technology/b_dps/real-time: No such file or directory` I get no such file or directory. So I need some other way to tackle with this problem. — arsenal, Jul 28 '12 at 02:50

phs · Accepted Answer · 2012-07-28T04:21:31.660

37

This is a hadoop "filesystem", not a POSIX one, so try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
  xargs -n 1 -I ^ -P 10 bash -c \
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done

edited Jul 28 '12 at 04:21

answered Jul 28 '12 at 02:44

phs

10,687
4
58
84

Problem with this is, its not a UNIX file system, its a Hadoop File System, whenever I try to do like this `bash-3.00$ cd /apps/hdmi-technology/b_dps/real-time bash: cd: /apps/hdmi-technology/b_dps/real-time: No such file or directory` I get no such file or directory. – arsenal Jul 28 '12 at 02:50
You're positive this directory exists? Can you mount it to a location, and then cd into it? – plast1K Jul 28 '12 at 02:53
I am not sure whether I can do this or not as that folder has TB of data inside. And how I can mount it to a location by the way? – arsenal Jul 28 '12 at 02:56
Thanks phs for the solution, so I can just copy paste the above command into bash prompt directly right? Or I need to do something else? – arsenal Jul 28 '12 at 03:20
When I copy pasted the first command that you mentioned which is slow as said by you. In my screen this thing is getting printed continuously one line after another `grep: illegal option -- q Usage: grep -hblcnsviw pattern file . . . Usage: java FsShell [-cat ] grep: illegal option -- q Usage: grep -hblcnsviw pattern file . . . cat: Unable to write to output stream. grep: illegal option -- q ` Any idea why? Or it is working fine? – arsenal Jul 28 '12 at 03:22
Copy-paste should do it. Mind, I don't have your cluster in front of me to test. However based on http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html I feel this should work. – phs Jul 28 '12 at 03:23
And after trying second command that you gave me, I am getting- `bash-3.00$ hadoop fs -ls /apps/hdmi-technology/b_apdpds/real-time | awk '{print $8}' | \ xargs -n 1 -I ^ -P 10 bash -c \ "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^" bash: : command not found ` command not found, don't know why. – arsenal Jul 28 '12 at 03:25
Good heavens. What operating system are you using? – phs Jul 28 '12 at 03:26
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14553/discussion-between-phs-and-rjchar) – phs Jul 28 '12 at 03:26
Short answer: he's using SunOS, there's nothing intrinsically wrong with the approach – phs Jul 28 '12 at 03:29
BTW, you can also pass -C to the ls command to avoid having to call awk: hadoop fs -ls -C /apps/hdmi-technology/b_dps/real-time – Alvaro Mendez Jul 28 '21 at 22:54

score 2 · Answer 2 · edited Aug 24 '18 at 05:24

2

You are looking to applying grep command on hdfs folder

hdfs dfs -cat /user/coupons/input/201807160000/* | grep -c null

here cat recursively goes through all files in the folder and I have applied grep to find count.

edited Aug 24 '18 at 05:24

Pang

9,564
146
81
122

answered Aug 24 '18 at 05:20

Mukesh Gupta

101
1
1
6

David Ongaro · Answer 3 · 2017-12-23T03:46:37.330

Using hadoop fs -cat (or the more generic hadoop fs -text) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh:

#!/bin/bash
grep -q $1 && echo $mapreduce_map_input_file
cat >/dev/null # ignore the rest

Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closed exceptions.

Then issue the commands

hadoop jar $HADOOP_HOME/hadoop-streaming.jar\
 -Dstream.non.zero.exit.is.failure=false\
 -files get_filename_for_pattern.sh\
 -numReduceTasks 1\
 -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\
 -reducer "uniq"\
 -input /apps/hdmi-technology/b_dps/real-time/*\
 -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc
hadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*

In newer distributions mapred streaming instead of hadoop jar $HADOOP_HOME/hadoop-streaming.jar should work. In the latter case you have to set your $HADOOP_HOME correctly in order to find the jar (or provide the full path directly).

For simpler queries you don't even need a script but just can provide the command to the -mapper parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.

If you don't need a reduce phase provide the symbolic NONE parameter to the respective -reduce option (or just use -numReduceTasks 0). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.

score 0 · Answer 4 · edited Jun 24 '19 at 14:21

0

To find all files with any extension recursively inside hdfs location:

hadoop fs -find  hdfs_loc_path  -name ".log"

edited Jun 24 '19 at 14:21

Laurenz Albe

209,280
17
206
263

answered Jun 24 '19 at 14:01

Gourav Goutam

81
1
4

Yeah, I use this on daily purpose. And there are so many ways to use this command. – Gourav Goutam Aug 25 '19 at 05:29

score 0 · Answer 5 · edited Oct 01 '19 at 15:05

0

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "*bcd4bc3e1380a56108f486a4fffbc8dc*"

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "bcd4bc3e1380a56108f486a4fffbc8dc"

edited Oct 01 '19 at 15:05

vikrant rana

4,509
6
32
72

answered Jul 26 '19 at 21:18

D Xia

465
1
4
6

Just to add a clarification: This answer provides a solution for searching the string `bcd4bc3e1380a56108f486a4fffbc8dc` in the file path / file name, NOT in the contents of the file. Still useful though :). For the latter, refer to [phs' answer](https://stackoverflow.com/a/11697831/4528111) above. – gnsb Feb 11 '20 at 23:02

Grep across multiple files in Hadoop Filesystem

5 Answers5

Linked