We have some json data stored into HDFS and we are trying to use elasticsearch-hadoop map reduce to ingest data into Elasticsearch.
The code we used is very simple (below)
public class TestOneFileJob extends Configured implements Tool {
public static class Tokenizer extends MapReduceBase
implements Mapper<LongWritable, Text, LongWritable, Text> {
@Override
public void map(LongWritable arg0, Text value, OutputCollector<LongWritable, Text> output,
Reporter reporter) throws IOException {
output.collect(arg0, value);
}
}
@Override
public int run(String[] args) throws Exception {
JobConf job = new JobConf(getConf(), TestOneFileJob.class);
job.setJobName("demo.mapreduce");
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(EsOutputFormat.class);
job.setMapperClass(Tokenizer.class);
job.setSpeculativeExecution(false);
FileInputFormat.setInputPaths(job, new Path(args[1]));
job.set("es.resource.write", "{index_name}/live_tweets");
job.set("es.nodes", "els-test.css.org");
job.set("es.input.json", "yes");
job.setMapOutputValueClass(Text.class);
JobClient.runJob(job);
return 0;
}
public static void main(String[] args) throws Exception {
System.exit(ToolRunner.run(new TestOneFileJob(), args));
}
}
This code worked fine but we have two issues with it.
The first issue is the value of es.resource.write
property. Currently it is provided by the property index_name
from the json.
If the json contains a property of type array like
{
"tags" : [{"tag" : "tag1"}, {"tag" : "tag2"}]
}
How can we configure the es.resource.write
to take the first tag
value for example?
we tried to use {tags.tag}
and {tags[0].tag}
but either did not work.
The other issue, how can I make the job index the json document in the two values of the tags property?