I am trying to read a large CSV file from S3. My file size is 100MB in GZip format which I need to unzip and then read csv data.
SO I have found below answer for the same and below code snippet do the trick.
S3Object fileObj = client.getObject(bucketName, repoPath);
BufferedReader reader = new BufferedReader(new InputStreamReader(new GZIPInputStream(fileObj.getObjectContent())));
BufferedWriter fileWriter = new BufferedWriter(new FileWriter(new File("output.json")));
String line = null;
while ((line = reader.readLine()) != null) {
//convert csv data to json
fileWriter.write(line +"\n");
}
fileWriter.flush();
fileWriter.close();
I have two query on above code:
- Where does extraction happen in a local system temp directory/JVM or on S3?
- How does it solve the memory issue?
While using the spark, It takes more time and I not sure how to process gz file in spark.