0

I am trying to read a large CSV file from S3. My file size is 100MB in GZip format which I need to unzip and then read csv data.

SO I have found below answer for the same and below code snippet do the trick.

        S3Object fileObj = client.getObject(bucketName, repoPath);
        BufferedReader reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(fileObj.getObjectContent())));
        BufferedWriter fileWriter = new BufferedWriter(new FileWriter(new File("output.json")));

        String line = null;
        while ((line = reader.readLine()) != null) {
            //convert csv  data to json
            fileWriter.write(line +"\n");
        } 
        fileWriter.flush();
        fileWriter.close();

I have two query on above code:

  1. Where does extraction happen in a local system temp directory/JVM or on S3?
  2. How does it solve the memory issue?

While using the spark, It takes more time and I not sure how to process gz file in spark.

ManojP
  • 6,113
  • 2
  • 37
  • 49
  • extraction happens in memory, as you read it chunk by chunk, your code doesn't have it. What is the "memory" issue? A single file of 100Mb gzipped is not that large enough to cause memory issues. Also, your question is tagged with apache-spark, but no mention of it in the question itself. – khachik May 01 '18 at 02:39
  • I am trying to get it done using spark – ManojP May 01 '18 at 03:45
  • What "memory issue"? – David Conrad May 01 '18 at 05:05
  • s3 file is 22mb, when I download through browser it is 650 mb approx, but when I using java GZIPInputStream then it goes out of memory after 3 gb, any idea? – Aadam Nov 08 '19 at 14:25

1 Answers1

2

I think you should first unzip the GZipped files and then read each text file or the unzipped directory using spark context. Since, Apache Spark uses the Hadoop FS API's to read your files on S3 to take the advantage of the distributed processing you should unzip them.

For MapReduce, if you need your compressed data to be splittable, BZip2, LZO, and Snappy formats are splittable, but GZip is not.

Once, your data is unzipped you may use the SparkContext to read the files as below

sparkContext.textFile("s3n://yourAccessKey:yourSecretKey@/path/")
wandermonk
  • 6,856
  • 6
  • 43
  • 93
  • In this case, # of IO operation will be high comparatively because we are extracting the file on s3. I am still not clear with first question, will it process my file in memory or on disc? – ManojP May 01 '18 at 10:26
  • When you load a file using Dataframe it is an in-memory processing. – wandermonk May 01 '18 at 12:11