0

I want to search in HDFS and list out the files that contains my search string exactly, and my second requirement is that is there any possible way to search for a range of values in a file HDFS.

let suppose below is my file and it contains the following data

/user/hadoop/test.txt

101,abc
102,def
103,ghi
104,aaa
105,bbb

is there any possible way to search with the range [101-104] so that it returns the files which contains the following data range.

.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Rajesh
  • 35
  • 1
  • 8
  • 1
    You've only listed one file here, but heard of MapReduce? That's about the only way to search your files – OneCricketeer May 26 '17 at 06:17
  • @cricket_007 Thanks for your quick response.That is for an example i had listed one sample file here, but there are number of similar files in my HDFS, and you mean to say by MapReduce only way to fulfill my requirement . And secondly when i am using hdfs dfs -ls -R / | grep [search_term] to search the files, it is listing out the entire list of files which contains a single character of search term but not the entire string of the search term. – Rajesh May 26 '17 at 06:34
  • You can't use `hdfs dfs` to recursively search all files. I literally mean MapReduce programming – OneCricketeer May 26 '17 at 06:37
  • 1
    Possible duplicate : https://stackoverflow.com/questions/11697810/grep-across-multiple-files-in-hadoop-filesystem – philantrovert May 26 '17 at 06:41
  • 1
    Your grep command there only searches filenames, not the file content for your number data – OneCricketeer May 26 '17 at 06:43
  • Possible duplicate of [Grep across multiple files in Hadoop Filesystem](https://stackoverflow.com/questions/11697810/grep-across-multiple-files-in-hadoop-filesystem) – vefthym May 26 '17 at 07:07

1 Answers1

1

To display file names having a pattern. Lets loop through hdfs directory which has files let say.

hdfs_files=`hdfs dfs -ls /user/hadoop/|awk '{print $8}'`
for file in `echo $hdfs_files`;
 do
  patterns=`hdfs dfs -cat $file|egrep -o "10[1-4]"`
  patterns_count=`echo $patterns|tr ' ' "\n"|wc -l`
   if [ $patterns_count -eq 4 ]; then 
    echo $file;
   fi
 done

Now solution to second requirement "search for a range of values in a file HDFS" using shell command:-

hdfs dfs -cat /user/hadoop/test.txt|egrep "10[1-4]"

output:-

101,abc
102,def
103,ghi
104,aaa

or just match first column

hdfs dfs -cat /user/hadoop/test.txt|egrep -o "10[1-4]"

output:-

101
102
103
104
sumitya
  • 2,631
  • 1
  • 19
  • 32