-1

This is my code snippet

@Override
    protected RecordWriter<String, String> getBaseRecordWriter(
            FileSystem fs, JobConf job, String name, Progressable arg3)
                    throws IOException {
        Path file2 = FileOutputFormat.getOutputPath(job);
        String path = file2.toUri().getPath()+File.separator+ name;
        FSDataOutputStream fileOut = new FSDataOutputStream( new BufferedOutputStream(new FileOutputStream(path, true),104857600)), null);
        return new LineRecordWriter<String, String>(fileOut, "\t");
    }

i am using Spark 1.6.1 and in my code i used saveAsHadoopFile() method for which i write a class OutputFormat derived from org.apache.hadoop.mapred.lib.MultipleTextOutputFormat and i overwrite the above method.

On cluster it writes corrupt records in output files. i think it is because of BufferedOutputStream in

FSDataOutputStream fileOut = new FSDataOutputStream(
                 new BufferedOutputStream(new FileOutputStream(path, true),104857600)), null);

Can we have any alternative for bufferedOutputStream, since it writes as soon as the buffer gets full.

Note: updated the code. Sorry for the inconvenience.

Harshal Zope
  • 1,458
  • 1
  • 11
  • 12
  • 2
    There is no `BufferedOutputStream` in your code, let alone any evidence for your belief that it is causing data corruption. Unclear what you're asking, and probable XY problem. – user207421 Oct 20 '16 at 08:30
  • The only corruption a BufferedOutputStream can cause is a truncated file, but only if you fail to flush() or close() it. – Peter Lawrey Oct 20 '16 at 08:44
  • I updated the code. I was trying different combinations hence got the wrong one. – Harshal Zope Oct 20 '16 at 09:34
  • The fact that `BufferedOutputStream` 'writes as soon as the buffer gets full' has no bearing on data corruption. You still have not explained your reasoning. – user207421 Oct 20 '16 at 09:45
  • I got the issue .. on cluster each worker will try to write in same (shared)file as both workers on different machine means different JVM and hence synchronized file write wont work here. thats why the corrupt records. Also i used NFS which is important factor. I actually want to acquire file level write lock before writing to file. any pointers on this will be helpfull – Harshal Zope Oct 20 '16 at 10:53

1 Answers1

0

I got the issue .. on cluster each worker will try to write in same (shared)file as both workers on different machine means different JVM and hence synchronized file write wont work here. thats why the corrupt records. Also i used NFS which is important factor.

Harshal Zope
  • 1,458
  • 1
  • 11
  • 12