Writing json from HDFS to Elasticsearch using elasticsearch-hadoop map-reduce

Question

We have some json data stored into HDFS and we are trying to use elasticsearch-hadoop map reduce to ingest data into Elasticsearch.

The code we used is very simple (below)

public class TestOneFileJob extends Configured implements Tool {

    public static class Tokenizer extends MapReduceBase
            implements Mapper<LongWritable, Text, LongWritable, Text> {

        @Override
        public void map(LongWritable arg0, Text value, OutputCollector<LongWritable, Text> output,
                Reporter reporter) throws IOException {

            output.collect(arg0, value);
        }

    }

    @Override
    public int run(String[] args) throws Exception {

        JobConf job = new JobConf(getConf(), TestOneFileJob.class);

        job.setJobName("demo.mapreduce");
        job.setInputFormat(TextInputFormat.class);
        job.setOutputFormat(EsOutputFormat.class);
        job.setMapperClass(Tokenizer.class);
        job.setSpeculativeExecution(false);

        FileInputFormat.setInputPaths(job, new Path(args[1]));

        job.set("es.resource.write", "{index_name}/live_tweets");

        job.set("es.nodes", "els-test.css.org");

        job.set("es.input.json", "yes");
        job.setMapOutputValueClass(Text.class);

        JobClient.runJob(job);

        return 0;
    }

    public static void main(String[] args) throws Exception {
        System.exit(ToolRunner.run(new TestOneFileJob(), args));
    }
}

This code worked fine but we have two issues with it.

The first issue is the value of es.resource.write property. Currently it is provided by the property index_name from the json.

If the json contains a property of type array like

{
"tags" : [{"tag" : "tag1"}, {"tag" : "tag2"}]
}

How can we configure the es.resource.write to take the first tag value for example?

we tried to use {tags.tag} and {tags[0].tag} but either did not work.

The other issue, how can I make the job index the json document in the two values of the tags property?

score 0 · Answer 1 · answered Dec 02 '15 at 06:09

We solved the two problems by doing the following

1- In the run method we put the value of es.resource.write as following

job.set("es.resource.write", "{tag}/live_tweets");

2- In the map function we convert the json into an object using gson library

Object currentValue = gson.fromJson(jsonString, Object.class);

The object here is the POJO of the json we have

3- From the Object we could extract the tag we want and add its value as a new property to the json.

The previous steps solved the first problem. Regarding the second problem (if we want the same json to be stored into multiple indexes based on the number of tags), we simply looped through the tags in the json and change the tag property we added then pass the json again to the collector. Below is the code required for this step.

@Override
        public void map(LongWritable arg0, Text value, OutputCollector<LongWritable, Text> output, Reporter reporter)
                throws IOException {

            List<String> tags = getTags(value.toString());

            for (String tag : tags) {

                String newJson = value.toString().replaceFirst("\\{", "{\"tag\":\""+tag+"\",");

                output.collect(arg0, new Text(newJson));
            }
        }

Writing json from HDFS to Elasticsearch using elasticsearch-hadoop map-reduce

1 Answers1