Extracting rows containing specific value using mapReduce and hadoop

Question

I'm new to developing map-reduce function. Consider I have csv file containing four column data.

For example:

101,87,65,67  
102,43,45,40  
103,23,56,34  
104,65,55,40  
105,87,96,40

Now, I want extract say

40 102  
40 104  
40 105

as those row contain 40 in forth column.

How to write map reduce function?

I have edited my answer to adress your new requirements. Keep me updated if you need assitance. — Serhiy, May 05 '16 at 18:40

score 4 · Accepted Answer · edited May 23 '17 at 12:24

4

Basically WordCount example resembles very well what you are trying to achieve. Instead of initializing the count per each word, you should have a condition to check if the tokenized String has required value and only in that case you write to context. This will work, since Mapper will receive each line of the CSV separately.

Now Reducer will receive the list of the values, already organized per key. In Reducer, instead of having IntWritable as output value type, you can use NullWritable for return value type, so your code will only output the keys. Also you do not need the cycle in Reducer, since you only would like to output the keys.

I do not provide you any code in my answer, since you will learn nothing from that. Make you way from the recommendations.

EDIT: since you modified you question with request for Reducer, here are some tips how you can achieve what you want.

One of the possibilities for achiving desired result is: in Mapper, after splitting (or tekenizing) the line, you write to context column 3 as key and column 0 as value. Your Reducer, since you do not need to any kind of aggregation, can simply write the keys and values produced by Mappers (yep, your Reducer code will end up with a single line of code). You can check one of my previous answers, the figure there explains quite well what Map and Reduce phases are doing.

edited May 23 '17 at 12:24

Community

1
1

answered May 04 '16 at 08:02

Serhiy

4,073
3
36
66

pos= 0; while (tokens.hasMoreTokens()) { word.set(tokens.nextToken()); pos=pos+1; if(pos == 2) sale = word; if(pos==3 && word.equals(40)) context.write(word, sale); } In Mapper function can it be done like this with the use of "pos" variable (I'm using string tokenizer) – user6119874 May 05 '16 at 05:27
1

An upvote for encouraging the OP to work with you to produce a solution. – halfer May 05 '16 at 15:38
1

@user6119874 I would suggest using String [] array = inputValue.split(","),so you will end up with array of smaller strings. You just need to call array[3] to get the value of the 4th column (you will not need to iterate explicitely). For reducer I will edit my answer. – Serhiy May 05 '16 at 16:16
@user6119874 You should make a new questions for this, since the problem is completely different, with detailed explanation how you are performing submission. But from the first look, you are missing your application JAR in your Hadoop's classpath. Check this http://stackoverflow.com/questions/26748811/setting-external-jars-to-hadoop-classpath you might find some hints there ;) – Serhiy May 06 '16 at 22:07
I was able to run program now. But there is small problem in output ... Here is a link to that problem ... [http://stackoverflow.com/q/37093158/6119874] – user6119874 May 07 '16 at 20:07

Extracting rows containing specific value using mapReduce and hadoop

1 Answers1

Linked