I have configured Apache Nutch 1.x for web crawling. There is a requirement that I should add some extra information to Solr document for each domain that is indexed. Configuration is a JSON file. I have developed following code for this and tested in local mode successfully. I have updated index-basic plugin. Code snippet is as follows:
this.enable_extra_domain = conf.getBoolean("domain.extraInfo.enable", false);
if (this.enable_extra_domain) {
String domainExtraInfo = conf.get("domain.extraInfo.file","conf/domain-extra.json");
readDomainFile(domainExtraInfo);
LOG.info("domain.extraInfo.enable is enabled. Using " + domainExtraInfo + " for input.");
}
else {
LOG.info("domain.extraInfo.enable is disabled.");
}
And the function where reading file is done is as below
private void readDomainFile(String domainExtraInfo) {
// Instance of our Domain map with extra info
website_records = new HashMap<String, List<Object>>();
JSONParser jsonParser = new JSONParser();
try (FileReader reader = new FileReader(domainExtraInfo))
{
Object obj = jsonParser.parse(reader);
JSONArray DomainList = (JSONArray) obj;
DomainList.forEach( domain -> parseDomainObject( (JSONObject) domain ) );
}
catch (Exception e) {
// TODO: handle exception
e.printStackTrace();
}
}
This code work successfully when I run it in local mode. But when I run Nutch with .job file to run on EMR (or other Hadoop cluster), I faced java.io.filenotfoundexception
. Where is the problem ? I have my new configuration file in conf folder in local mode while in deploy, it is added in .job file