0

I have a Job that consists of 3 steps. My input is encrypted JSON objects (one per line) stored in Amazon S3. (s3e://).

Job parameters:

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

Some other important parameters:

mapred.min.split.size           0
mapred.job.reuse.jvm.num.tasks  -1
fs.s3.block.size             67108864

I am facing issues where the mapper of my second step almost always fails with the exception that my JSON is incorrectly terminated. Upon further investigation I established that the JSON itself is correct in the input file, and it's the mapper that reads an incomplete value. The value read by TextInputFormat is incomplete and incorrectly terminated.

JsonException Value: {..."tag_action_code":"ndi","tag_value":"tribhutes
FATAL - JSON exception while handling exception 
org.json.JSONException: Unterminated string at character 390
    at org.json.JSONTokener.syntaxError(JSONTokener.java:410)
    at org.json.JSONTokener.nextString(JSONTokener.java:244)
    at org.json.JSONTokener.nextValue(JSONTokener.java:341)
    at org.json.JSONObject.<init>(JSONObject.java:190)
    at org.json.JSONObject.<init>(JSONObject.java:402)
    at com.amazon.associates.lib.ExtractItemMapReduce.putItemProcessingStateToExtracItemText(ExtractItemMapReduce.java:92)
    at com.amazon.associates.mapred.common.FatTireMapper$1.putItemProcessingState(FatTireMapper.java:51)
    at com.amazon.associates.mapred.common.FatTireMapReduceExecutor.handleException(FatTireMapReduceExecutor.java:35)
    at com.amazon.associates.mapred.common.FatTireMapperExecutor.execute(FatTireMapperExecutor.java:55)
    at com.amazon.associates.mapred.common.FatTireMapper.map(FatTireMapper.java:63)
    at com.amazon.associates.mapred.common.FatTireMapper.map(FatTireMapper.java:21)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:771)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:375)
    at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.mapred.Child.main(Child.java:249)
Filename: part-00004 Split Details: Start: 0 Length: 593575152
Key: 536936059 Value: {..."tag_action_code":"ndi","tag_value":"tribhutes

This is happening pretty consistently. But the funny part is, sometimes the second step goes through and it fails at the third step.

My test data is pretty huge, and after successful completion of first step (which always goes through) I get 5 550-600 MB checkpoint intermittent files, which are input to the the second step.

In one of the tries where the input to the second step was not encrypted, it succeeded.

I am pretty stuck. Any kind of pointers or help would be highly appreciated.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
Kamesh Rao Yeduvakula
  • 1,215
  • 2
  • 15
  • 27

1 Answers1

0

Is it possible with your encryption scheme that the encrypted version of a record could contain a newline character? If so, this will cause Hadoop to incorrectly consider that JSON object as two separate records. That's my guess as to what is happening here. Be sure to very carefully escape or remove newline characters from your data when using TextInputFormat.

Joe K
  • 18,204
  • 2
  • 36
  • 58
  • I have verified the data to not contain new line in places where it should not be present. And also, if something was wrong with data, why would it go through in certain runs as I mentioned? – Kamesh Rao Yeduvakula Feb 01 '13 at 03:12