Uploading large gzipped data files to HDFS

Question

I have a use case where I want to upload big gzipped text data files (~ 60 GB) on HDFS.

My code below is taking about 2 hours to upload these files in chunks of 500 MB. Following is the pseudo code. I was chekcing if somebody could help me reduce this time:

i) int fileFetchBuffer = 500000000; System.out.println("file fetch buffer is: " + fileFetchBuffer); int offset = 0; int bytesRead = -1;

    try {
        fileStream = new FileInputStream (file);    
        if (fileName.endsWith(".gz")) {
            stream = new GZIPInputStream(fileStream);

            BufferedReader reader = new BufferedReader(new InputStreamReader(stream)); 

            String[] fileN = fileName.split("\\.");
            System.out.println("fil 0 : " + fileN[0]);
            System.out.println("fil 1 : " + fileN[1]);
            //logger.info("First line is: " + streamBuff.readLine());

            byte[] buffer = new byte[fileFetchBuffer];

            FileSystem fs = FileSystem.get(conf);

            int charsLeft = fileFetchBuffer;
            while (true) {

                charsLeft = fileFetchBuffer;    



             logger.info("charsLeft outside while: " + charsLeft);

          FSDataOutputStream dos = null;
                while (charsLeft != 0) {
                    bytesRead = stream.read(buffer, 0, charsLeft);
                    if (bytesRead < 0) {
                        dos.flush();
                        dos.close();
                        break;
                    }
                    offset = offset + bytesRead;
                    charsLeft = charsLeft - bytesRead; 
                    logger.info("offset in record: " + offset);
                    logger.info("charsLeft: " + charsLeft);
                    logger.info("bytesRead in record: " + bytesRead);
                    //prettyPrintHex(buffer);

                    String outFileStr = Utils.getOutputFileName(
                            stagingDir,
                            fileN[0],
                            outFileNum);

                    if (dos == null) {
                    Path outFile = new Path(outFileStr);
                    if (fs.exists(outFile)) {
                        fs.delete(outFile, false);
                    }

                    dos = fs.create(outFile);
                    }

                    dos.write(buffer, 0, bytesRead);


                } 

                logger.info("done writing: " + outFileNum);
                dos.flush();
                dos.close();

                if (bytesRead < 0) {
                    dos.flush();
                    dos.close();
                    break;
                }

                outFileNum++;

            }  // end of if


        } else {
            // Assume uncompressed file
            stream = fileStream;
        }           

    } catch(FileNotFoundException e) {
        logger.error("File not found" + e);
    }

score 0 · Answer 1 · answered Jun 22 '11 at 16:26

0

You should consider using the super package IO from Apache.

It has a method

IOUtils.copy( InputStream, OutputStream )

that would tremendously reduce time needed to copy your files.

answered Jun 22 '11 at 16:26

Snicolas

37,840
15
114
173

@Snicolas - how do I split InputStream ? For eg 60 Gb has to be uploaded in 1 Gb chunks. How would this function know from where in InputStream to copy ? – user656189 Jun 22 '11 at 16:30
You could consider subclassing FilterInputStream to make a new class that reads your original inputstream from a certain offset, and not further than 1Gb. – Snicolas Jun 22 '11 at 16:42
Aother option could be to use FileChannels and the method transferTo, that would be quite efficient too. – Snicolas Jun 22 '11 at 16:42
@Snicolas - how cna i create a filechannel in gzipped input stream ? – user656189 Jun 22 '11 at 17:12
@user656189 you need to upload your files unzipped ? Or you just want to put your zipped file of 60 Gb in slices of 1 Gb ? It's not the same problem. – Snicolas Jun 22 '11 at 17:50
@Snicolas - unzipped is fine for me. zipped is even better. Bigger question for me is the time right now. With above, code it is taking 2.5 hours to upload 60 GB zipped file into 770 500 MB uncompressed files. – user656189 Jun 22 '11 at 18:07
@user656189 and why do you need to split it in chunks of 1 Gb ? – Snicolas Jun 22 '11 at 18:10
@user656189 and what is taking most of the time : network transfer or unzipping operations ? What's your maximum transfer rate ? – Snicolas Jun 22 '11 at 18:11
@Snicolas - it is a requirement that i was provided to split in 1 Gb chunk as it needs to be processed in parallel in hadoop.most of the time is going in stream.read() in while loop above since it reads in some internal buffer size and i see it read few KBs. till it reached the byte buffer completely. network transfer is not bottleneck. – user656189 Jun 22 '11 at 18:21
@user656189 so you're spending most of your time uncompressing data from your input stream. If you don't use a gzippedInputStream, but a simple BufferedInputStream, you could transfer your gziped file without uncompressing it. This would be much faster. – Snicolas Jun 22 '11 at 18:23
@user656189 what needs to be parallalized in your app on the side of the remote computer. File extraction ? Unzipping ? Treatment of data ? – Snicolas Jun 22 '11 at 18:25
treatment of data after gzipped files is split into multiple chunks of fixed size. – user656189 Jun 22 '11 at 18:28
@user656189 sorry it's really not clear enough for me, it would take like 100 questions and answer to find a better answer from my side ...http://stackoverflow.com/questions/1533330/writing-data-to-hadoop – Snicolas Jun 22 '11 at 18:32
@Snicolas - thanks for sharing this. i m also doing almost similar to what this link says but my problem is in splitting the 60 GB gzipped file as chunks of 1 Gb each (as i wrote in my original description, which is all java code). So, I thought of asking here. I was trying to find out better way than using in.read() to get data out of disk. I think write to output file is not the bottleneck. – user656189 Jun 22 '11 at 18:40
@user656189 no for sure, the bottleneck seems to be the unzipping of your 60 Gig file. So either you can slice it (without decompressing) and send slices on the network (it means that you will have to wait for the whole transfer before uncompressing) OR you unzip it, send individual files (or retarred file sets) and transfer it on the network so that every client in your cluster can do some work independently of the rest of the files. – Snicolas Jun 22 '11 at 23:14
@Snicolas - any pointers on how to slice it without decompressing ? – user656189 Jun 23 '11 at 00:09
@user656189 sure, but then , you will have to wait for all the slices to reach the server, then reassemble the slices together before being able to uncompress it. Is it ok for you ? – Snicolas Jun 23 '11 at 02:05
@Snicolas - yes, how can i do it ? – user656189 Jun 23 '11 at 05:29

Snicolas · Answer 2 · 2011-06-23T09:41:13.253

I tried with buffered input stream and saw no real difference. I suppose a file channel implementation could be even more efficient. Tell me if it's not fast enough.

package toto;

import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;

public class Slicer {

    private static final int BUFFER_SIZE = 50000;

    public static void main(String[] args) {

        try 
        {
            slice( args[ 0 ], args[ 1 ], Long.parseLong( args[2]) );
        }//try
        catch (IOException e) 
        {
            e.printStackTrace();
        }//catch
        catch( Exception ex )
        {
            ex.printStackTrace();
            System.out.println( "Usage :  toto.Slicer <big file> <chunk name radix > <chunks size>" );
        }//catch
    }//met

    /**
     * Slices a huge files in chunks.
     * @param inputFileName the big file to slice.
     * @param outputFileRadix the base name of slices generated by the slicer. All slices will then be numbered outputFileRadix0,outputFileRadix1,outputFileRadix2...
     * @param chunkSize the size of chunks in bytes
     * @return the number of slices.
     */
    public static int slice( String inputFileName, String outputFileRadix, long chunkSize ) throws IOException
    {
        //I would had some code to pretty print the output file names
        //I mean adding a couple of 0 before chunkNumber in output file name
        //so that they all have same number of chars
        //use java.io.File for that, estimate number of chunks, take power of 10, got number of leading 0s

        //just to get some stats
        long timeStart = System.currentTimeMillis();
        long timeStartSlice = timeStart;
        long timeEnd = 0;

        //io streams and chunk counter
        int chunkNumber = 0;
        FileInputStream fis = null;
        FileOutputStream fos = null;

        try 
        {
            //open files
            fis = new FileInputStream( inputFileName );
            fos = new FileOutputStream( outputFileRadix + chunkNumber );

            //declare state variables
            boolean finished = false;
            byte[] buffer = new byte[ BUFFER_SIZE ];
            int bytesRead = 0;
            long bytesInChunk = 0;


            while( !finished )
            {
                //System.out.println( "bytes to read " +(int)Math.min( BUFFER_SIZE, chunkSize - bytesInChunk ) );
                bytesRead = fis.read( buffer,0, (int)Math.min( BUFFER_SIZE, chunkSize - bytesInChunk ) );

                if( bytesRead == -1 )
                    finished = true;
                else
                {
                                            fos.write( buffer, 0, bytesRead );
                    bytesInChunk += bytesRead;
                    if( bytesInChunk == chunkSize )
                    {
                        if( fos != null )
                        {
                            fos.close();
                            timeEnd = System.currentTimeMillis();
                            System.out.println( "Chunk "+chunkNumber + " has been generated in "+ (timeEnd - timeStartSlice) +" ms");
                            chunkNumber ++;
                            bytesInChunk = 0;
                            timeStartSlice = timeEnd;
                            System.out.println( "Creating slice number " + chunkNumber );
                            fos = new FileOutputStream( outputFileRadix + chunkNumber );
                        }//if
                    }//if
                }//else
            }//while
        }
        catch (Exception e) 
        {
            System.out.println( "A problem occured during slicing : " );
            e.printStackTrace();
        }//catch
        finally 
        {
            //whatever happens close all files
            System.out.println( "Closing all files.");
            if( fis != null )
                fis.close();
            if( fos != null )
                fos.close();
        }//fin

        timeEnd = System.currentTimeMillis();
        System.out.println( "Total slicing time : " + (timeEnd - timeStart) +" ms" );
        System.out.println( "Total number of slices "+ (chunkNumber +1) );

        return chunkNumber+1;
    }//met
}//class

Greetings, Stéphane

@Snicolas - Thanks a lot for spending time for this ! Isn't this same as my solution ? Or this would be faster. — user656189, Jun 23 '11 at 16:15
Tremendously faster, I don't uncompress the 60 Gb tar file... Please have a try. If my times are better, accept my answer ;) — Snicolas, Jun 23 '11 at 16:43
I will try but thing is that will it split on line boundaries ? — user656189, Jun 23 '11 at 16:57
nope it just slices the file. As I don't uncompress the file, there is no way it can be sliced on any boundaries, you just got slices of equal and predefined size. I thought that was what we agreed on yesterday. That's the best I can do. — Snicolas, Jun 23 '11 at 17:30
@Snicolas - i agree. i will run it. i can feel it is going to be faster. So, basically now we have compressed slices right ? — user656189, Jun 23 '11 at 18:16
yes, but there is no way to uncompress them before you join them on a server. — Snicolas, Jun 23 '11 at 18:20

Uploading large gzipped data files to HDFS

2 Answers2