I have a Job that consists of 3 steps. My input is encrypted JSON objects (one per line) stored in Amazon S3. (s3e://).
Job parameters:
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
Some other important parameters:
mapred.min.split.size 0
mapred.job.reuse.jvm.num.tasks -1
fs.s3.block.size 67108864
I am facing issues where the mapper of my second step almost always fails with the exception that my JSON is incorrectly terminated. Upon further investigation I established that the JSON itself is correct in the input file, and it's the mapper that reads an incomplete value. The value read by TextInputFormat
is incomplete and incorrectly terminated.
JsonException Value: {..."tag_action_code":"ndi","tag_value":"tribhutes
FATAL - JSON exception while handling exception
org.json.JSONException: Unterminated string at character 390
at org.json.JSONTokener.syntaxError(JSONTokener.java:410)
at org.json.JSONTokener.nextString(JSONTokener.java:244)
at org.json.JSONTokener.nextValue(JSONTokener.java:341)
at org.json.JSONObject.<init>(JSONObject.java:190)
at org.json.JSONObject.<init>(JSONObject.java:402)
at com.amazon.associates.lib.ExtractItemMapReduce.putItemProcessingStateToExtracItemText(ExtractItemMapReduce.java:92)
at com.amazon.associates.mapred.common.FatTireMapper$1.putItemProcessingState(FatTireMapper.java:51)
at com.amazon.associates.mapred.common.FatTireMapReduceExecutor.handleException(FatTireMapReduceExecutor.java:35)
at com.amazon.associates.mapred.common.FatTireMapperExecutor.execute(FatTireMapperExecutor.java:55)
at com.amazon.associates.mapred.common.FatTireMapper.map(FatTireMapper.java:63)
at com.amazon.associates.mapred.common.FatTireMapper.map(FatTireMapper.java:21)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
at org.apache.hadoop.mapred.Child.main(Child.java:249)
Filename: part-00004 Split Details: Start: 0 Length: 593575152
Key: 536936059 Value: {..."tag_action_code":"ndi","tag_value":"tribhutes
This is happening pretty consistently. But the funny part is, sometimes the second step goes through and it fails at the third step.
My test data is pretty huge, and after successful completion of first step (which always goes through) I get 5 550-600 MB checkpoint intermittent files, which are input to the the second step.
In one of the tries where the input to the second step was not encrypted, it succeeded.
I am pretty stuck. Any kind of pointers or help would be highly appreciated.