21

I am working with Hadoop and I need to find which of ~100 files in my Hadoop filesystem contain a certain string.

I can see the files I wish to search like this:

bash-3.00$ hadoop fs -ls /apps/mdhi-technology/b_dps/real-time

..which returns several entries like this:

-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_aa
-rw-r--r--   3 b_dps mdhi-technology 1073741824 2012-07-18 22:50 /apps/mdhi-technology/b_dps/HADOOP_consolidated_RT_v1x0_20120716_ab

How do I find which of these contains the string bcd4bc3e1380a56108f486a4fffbc8dc? Once I know, I can edit them manually.

AKIWEB
  • 19,008
  • 67
  • 180
  • 294
arsenal
  • 23,366
  • 85
  • 225
  • 331
  • Problem with this is, its not a UNIX file system, its a Hadoop File System, whenever I try to do like this `bash-3.00$ cd /apps/hdmi-technology/b_dps/real-time bash: cd: /apps/hdmi-technology/b_dps/real-time: No such file or directory` I get no such file or directory. So I need some other way to tackle with this problem. – arsenal Jul 28 '12 at 02:50

5 Answers5

37

This is a hadoop "filesystem", not a POSIX one, so try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
while read f
do
  hadoop fs -cat $f | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo $f
done

This should work, but it is serial and so may be slow. If your cluster can take the heat, we can parallelize:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | \
  xargs -n 1 -I ^ -P 10 bash -c \
  "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^"

Notice the -P 10 option to xargs: this is how many files we will download and search in parallel. Start low and increase the number until you saturate disk I/O or network bandwidth, whatever is relevant in your configuration.

EDIT: Given that you're on SunOS (which is slightly brain-dead) try this:

hadoop fs -ls /apps/hdmi-technology/b_dps/real-time | awk '{print $8}' | while read f; do hadoop fs -cat $f | grep bcd4bc3e1380a56108f486a4fffbc8dc >/dev/null && echo $f; done
phs
  • 10,687
  • 4
  • 58
  • 84
  • Problem with this is, its not a UNIX file system, its a Hadoop File System, whenever I try to do like this `bash-3.00$ cd /apps/hdmi-technology/b_dps/real-time bash: cd: /apps/hdmi-technology/b_dps/real-time: No such file or directory` I get no such file or directory. – arsenal Jul 28 '12 at 02:50
  • You're positive this directory exists? Can you mount it to a location, and then cd into it? – plast1K Jul 28 '12 at 02:53
  • I am not sure whether I can do this or not as that folder has TB of data inside. And how I can mount it to a location by the way? – arsenal Jul 28 '12 at 02:56
  • Thanks phs for the solution, so I can just copy paste the above command into bash prompt directly right? Or I need to do something else? – arsenal Jul 28 '12 at 03:20
  • When I copy pasted the first command that you mentioned which is slow as said by you. In my screen this thing is getting printed continuously one line after another `grep: illegal option -- q Usage: grep -hblcnsviw pattern file . . . Usage: java FsShell [-cat ] grep: illegal option -- q Usage: grep -hblcnsviw pattern file . . . cat: Unable to write to output stream. grep: illegal option -- q ` Any idea why? Or it is working fine? – arsenal Jul 28 '12 at 03:22
  • Copy-paste should do it. Mind, I don't have your cluster in front of me to test. However based on http://hadoop.apache.org/common/docs/r0.18.3/hdfs_shell.html I feel this should work. – phs Jul 28 '12 at 03:23
  • And after trying second command that you gave me, I am getting- `bash-3.00$ hadoop fs -ls /apps/hdmi-technology/b_apdpds/real-time | awk '{print $8}' | \ xargs -n 1 -I ^ -P 10 bash -c \ "hadoop fs -cat ^ | grep -q bcd4bc3e1380a56108f486a4fffbc8dc && echo ^" bash: : command not found ` command not found, don't know why. – arsenal Jul 28 '12 at 03:25
  • Good heavens. What operating system are you using? – phs Jul 28 '12 at 03:26
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/14553/discussion-between-phs-and-rjchar) – phs Jul 28 '12 at 03:26
  • Short answer: he's using SunOS, there's nothing intrinsically wrong with the approach – phs Jul 28 '12 at 03:29
  • BTW, you can also pass -C to the ls command to avoid having to call awk: hadoop fs -ls -C /apps/hdmi-technology/b_dps/real-time – Alvaro Mendez Jul 28 '21 at 22:54
2

You are looking to applying grep command on hdfs folder

hdfs dfs -cat /user/coupons/input/201807160000/* | grep -c null

here cat recursively goes through all files in the folder and I have applied grep to find count.

Pang
  • 9,564
  • 146
  • 81
  • 122
Mukesh Gupta
  • 101
  • 1
  • 1
  • 6
0

Using hadoop fs -cat (or the more generic hadoop fs -text) might be feasible if you just have two 1 GB files. For 100 files though I would use the streaming-api because it can be used for adhoc-queries without resorting to a full fledged mapreduce job. E.g. in your case create a script get_filename_for_pattern.sh:

#!/bin/bash
grep -q $1 && echo $mapreduce_map_input_file
cat >/dev/null # ignore the rest

Note that you have to read the whole input, in order to avoid getting java.io.IOException: Stream closed exceptions.

Then issue the commands

hadoop jar $HADOOP_HOME/hadoop-streaming.jar\
 -Dstream.non.zero.exit.is.failure=false\
 -files get_filename_for_pattern.sh\
 -numReduceTasks 1\
 -mapper "get_filename_for_pattern.sh bcd4bc3e1380a56108f486a4fffbc8dc"\
 -reducer "uniq"\
 -input /apps/hdmi-technology/b_dps/real-time/*\
 -output /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc
hadoop fs -cat /tmp/files_matching_bcd4bc3e1380a56108f486a4fffbc8dc/*

In newer distributions mapred streaming instead of hadoop jar $HADOOP_HOME/hadoop-streaming.jar should work. In the latter case you have to set your $HADOOP_HOME correctly in order to find the jar (or provide the full path directly).

For simpler queries you don't even need a script but just can provide the command to the -mapper parameter directly. But for anything slightly complex it's preferable to use a script, because getting the escaping right can be a chore.

If you don't need a reduce phase provide the symbolic NONE parameter to the respective -reduce option (or just use -numReduceTasks 0). But in your case it's useful to have a reduce phase in order to have the output consolidated into a single file.

David Ongaro
  • 3,568
  • 1
  • 24
  • 36
0

To find all files with any extension recursively inside hdfs location:

hadoop fs -find  hdfs_loc_path  -name ".log"
Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263
Gourav Goutam
  • 81
  • 1
  • 4
0
hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "*bcd4bc3e1380a56108f486a4fffbc8dc*"

hadoop fs -find /apps/mdhi-technology/b_dps/real-time  -name "bcd4bc3e1380a56108f486a4fffbc8dc"
vikrant rana
  • 4,509
  • 6
  • 32
  • 72
D Xia
  • 465
  • 1
  • 4
  • 6
  • Just to add a clarification: This answer provides a solution for searching the string `bcd4bc3e1380a56108f486a4fffbc8dc` in the file path / file name, NOT in the contents of the file. Still useful though :). For the latter, refer to [phs' answer](https://stackoverflow.com/a/11697831/4528111) above. – gnsb Feb 11 '20 at 23:02