0

I have a developed a code that reads very large files from FTP and writes it to local machine using Java. The code that does it is as follows . This is a part from the next(Text key, Text value) inside the RecordReader of the CustomInputFormat

 if(!processed)
            {
                            System.out.println("in processed");
                in = fs.open(file);
    processed=true; 
            }
while(bytesRead <= fileSize) {

                 byte buf[] = new byte[1024]; 

                try {
                    in.read(buf);
                    in.skip(1024);
                    bytesRead+=1024;
                    long diff = fileSize-bytesRead;
                    if(diff<1024)
                    {
                        break;
                    }
        value.set(buf, 0, 1024); // This is where the value of the record is set and it goes to the mapper . 
                } 
                catch(Exception e)
                {
                    e.printStackTrace();
                }

            }
            if(diff<1024)
            {
                int difference= (int) (fileSize-bytesRead);

                 byte buf[] = new byte[difference]; 
                in.read(buf);
                bytesRead+=difference;
            }

                    System.out.println("closing stream");
                    in.close();

After the write is over , I see that the transfer is done and the size of the file at the destination is same as that at the source. But I am unable to open the file and the editor gives the error as

gedit has not been able to detect the character coding.
Please check that you are not trying to open a binary file.
Select a character coding from the menu and try again.

This Question: Java upload jpg using JakartaFtpWrapper - makes the file unreadable is related to mine I believe , but I couldn't make sense of it.

Any pointers ?

Community
  • 1
  • 1
RadAl
  • 404
  • 5
  • 23

2 Answers2

3

Your copying code is complete and utter 100% A grade nonsense. The canonical way to copy a stream in Java is as follows:

int count;
byte[] buffer = new byte[8192]; // or more if you like
while ((count = in.read(buffer)) > 0)
{
  out.write(buffer, 0, count);
}

Get rid of all the other fluff. It is just wasting time and space and clearly damaging your data in transit.

user207421
  • 305,947
  • 44
  • 307
  • 483
  • Thanks for the answer . The code is a part of an application which has more to it than copying data ( though my problem lies in the copying part). The part in.read(buf); is where my copying of data takes place . The rest is required for the application , which needs pausing and resuming . Probably the way I have put it makes it appear nonsensical , but trust me its worth it .. – RadAl Jan 02 '13 at 08:06
  • @RadAl Trust *me,* there are several pieces of nonsense in your code, starting with allocating a new buffer every time around the loop, calling skip(), a new code block to handle the final buffer load, ignoring the *result* returned by read(), ... I could go on. You may well have more to do than just copy bytes until EOS, but you need to study the correct and concise loop above to see why it works, and why your own code is fundamentally flawed. – user207421 Jan 02 '13 at 08:50
  • also I assume that out in your snippet is an output stream .. I need to get the contents into a buffer . So whats the way I go about handling it ? write the contents from input stream to a buffer in chunks ? – RadAl Jan 02 '13 at 08:59
  • 1
    @RadAl You could use this code writing to a ByteArrayOutputStream, or you could just call value.set(buffer, 0, count), assuming that does the right things, but if the files are so large I would spool them to disk first rather than tryi to fit them into memory, and process them later. – user207421 Jan 02 '13 at 09:14
2

I see many problems with your code. It is a strange way to read a whole file. for example:

in.read(buf);
in.skip(1024);
bytesRead+=1024;

is wrong, in.read(buf) returns the number of bytes read and sets the streams position to the current position old-position + n read bytes. So you don't need to skip - thats an error, as read positioned the stream already.

Verify the checksums of the files to be sure, they are the same. (using md5 or something) I'm pretty sure neither the checksums, nor the filesizes are the same.

You should use apache commons-io for file processing. Otherwise look at oracle docs on file processing.

burna
  • 2,932
  • 18
  • 27