Hadoop MapReduce example for string transformation

Question

I have a big amount of strings in some text file and need transform this strings by such algorithm: convert string into lowercase and remove all spaces.

Can you give me example of Hadoop MapReduce function which implements that algorithm?

Thank you.

I've found some examples which show how to aggregate values over the keys. For example, count amount of words in input text. I'm wondering is there ability to transform input strings instead of calculating aggregate values with mapreduce procedures. Is it normal practice or it's not the best decision to do such things with map reduce? I'm not asking to do that job for me, but I want some simple example and and confirmation that I'm in right direction — Alex Zhulin, Apr 24 '16 at 09:16

score 0 · Answer 1 · answered Apr 25 '16 at 18:03

I tried the below code and getting the output in a single line.

public class toUpper {

public static class textMapper extends Mapper<LongWritable,Text,NullWritable,Text>
{
    Text outvalue=new Text();

    public void map(LongWritable key,Text values,Context context) throws IOException, InterruptedException
    {
        String token;
        StringBuffer br=new StringBuffer();
        StringTokenizer st=new StringTokenizer(values.toString());
        while(st.hasMoreTokens())
        {
            token=st.nextToken();
            br.append(token.toUpperCase()); 
        }
        st=null;
        outvalue.set(br.toString());
        context.write(NullWritable.get(), outvalue);
        br=null;

    }
}
public static class textReduce extends Reducer<NullWritable,Text,NullWritable,Text>
{
    Text outvale=new Text();
    public void reduce(NullWritable key,Iterable<Text> values,Context context) throws IOException, InterruptedException
    {
        StringBuffer br=new StringBuffer();
        for(Text st:values)
        {
            br.append(st.toString());
        }
        outvale.set(br.toString());
        context.write(NullWritable.get(), outvale);
    }
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
    Configuration conf=new Configuration();
    @SuppressWarnings("deprecation")
    Job job=new Job(conf,"touipprr");

    job.setJarByClass(toUpper.class);
    job.setMapperClass(textMapper.class);
    job.setReducerClass(textReduce.class);

    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    System.exit(job.waitForCompletion(true)?1:0);




}

}

score 0 · Answer 2 · answered Apr 26 '16 at 06:39

In the days when I was playing around with map-reduce, I had a similar thought that there must be some practice or technique through which we can modify every word in a record and do all the cleaning stuffs.
When we recap the entire algorithm of map-reduce, we have a map function, which splits the incoming records into tokens with the help of delimiters(perhaps you will know about them better). Now, let us try to approach the problem statement given by you in a descriptive manner.
Following are the things that I will try to do when I am new to map-reduce:

> I will probably write a map() method which will split the lines for me
> I will possibly run out of options and write a reduce function
 and somehow will be able to achieve my objective

The above practice is completely okay but there is a better technique that can help you to decide whether or not you are going to need the reduce function thereby you will have more options to enabling you think and completely focus on achieving your objective and also thinking about optimizing you code.

In such situations among which your problem statement falls into, a class came to my rescue : ChainMapper Now, how the ChainMapper is going to work? following are few points to be considered

-> The first mapper will read the file from HDFS, split each lines as per delimiter and store the tokens in the context.
-> Second mapper will get the output from the first mapper and here you can do all sorts of string related operations as you business requires such as encrypting the text or changing to upper case or lowercase etc.
-> The operated string which is the result of the second mapper shall be stored into the context again
-> Now, if you need a reducer to do the aggregation task such as wordcount, go for it.

I have a piece of code which may not be efficient ( or some may feel its horrible) but it serves your purpose as you might be playing around with mapreduce.

SplitMapper.java

public class SplitMapper extends Mapper<LongWritable,Text,Text,IntWritable>{
    @Override
    public void map(Object key,Text value,Context context)
                                    throws IOException,InterruptedException{
        StringTokenizer xs=new StringTokenizer(value.toString());
        IntWritable dummyValue=new IntWritable(1);
        while(xs.hasMoreElements()){
            String content=(String)xs.nextElement();
            context.write(new Text(content),dummyValue);
        }
    }
}

LowerCaseMapper.java

public class LowerCaseMapper extends Mapper<Text,IntWritable,Text,IntWritable>{
    @Override
    public void map(Text key,IntWritable value,Context context) 
                                        throws IOException,InterruptedException{
        String val=key.toString().toLowerCase();
        Text newKey=new Text(val);
        Context.write(newKey,value);    
    }
}

Since I am performing a wordcount here so I require a reducer

ChainMapReducer.java

public class ChainMapReducer extends Reducer<Text,IntWritable,Text,IntWritable>{
    @Override
    public void reduce(Text key,Iterable<IntWritable> value,Context context)
                                throws IOException,InterruptedException{
        int sum=0;
        for(IntWritable v:value){
            sum+=value.get();
        }
        context.write(key,new IntWritables(sum));
    }
}

To be able to implement the concept of chainmapper successfully, you must pay attention to every details of the driver class

DriverClass.java

public class DriverClass extends Configured implements Tool{
    static Configuration cf;
    public int run(String args[]) throws IOException,InterruptedException,ClassNotFoundException{
        cf=new Configuration();
        Job j=Job.getInstance(cf);
        //configuration for the first mapper
        Configuration.splitMapConfig=new Configuration(false);
        ChainMapper.addMapper(j,SplitMapper.class,Object.class,Text.class,Text.class,IntWritable.class,splitMapConfig);
        //configuration for the second mapper
        Configuration.lowerCaseConfig=new Configuration(false);
        ChainMapper.addMapper(j,LowerCaseMapper.class,Text.class,IntWritable.class,Text.class,IntWritable.class,lowerCaseConfig);

        j.setJarByClass(DriverClass.class);
        j.setCombinerClass(ChainMapReducer.class);
        j.setOutputKeyClass(Text.class);
        j.setOutputValueClass(IntWritable.class);

        Path outputPath=new Path(args[1]);
        FileInputFormat.addInputPath(j,new Path(args[0]));
        FileOutputFormat.setOutputPath(j,outputPath);
        outputPath.getFileSystem(cf).delete(outputPath,true);
    }
    public static void main(String args[]) throws Exception{
        int res=ToolRunner.run(cf,new DriverClass(),args);
        System.exit(1);
    }
}

The driver class is pretty much understandable only one needs to observe the signature of the ChainMapper.add(<job-object>,<Map-ClassName>,<Input arguments types>,<configuration-for-the-concerned-mapper>)

I hope that the solution serves your purpose, please let me know in case of any issues that might arise when you try to implement.
Thankyou!

you may remove unwanted spaces using trim() method in the second mapper which is the LoweCaseMapper.java itself! — Aniruddha Sinha, Apr 26 '16 at 06:53

Hadoop MapReduce example for string transformation

2 Answers2