How to unzip the files stored in hdfs using spark java

Question

List<String> list= jsc.wholeTextFiles(hdfsPath).keys().collect();
        for (String string : list) {
        System.out.println(string);
        }

Here i am getting all the zip files.From here i am unable to proceed how to extract each file and store into hdfs path with same zipname folder

i will suggest you can go for native coding with Java and do the unzip. Spark can help you to read the file using wholeTextFiles — Indrajit Swain, Dec 08 '17 at 09:51

score 3 · Accepted Answer · edited Dec 11 '17 at 11:34

You can use like below, But only thing we need to do collect at zipFilesRdd.collect().forEach before writing the contents into hdfs. Map and flat map gives task not serializable at this point.

public void readWriteZipContents(String zipLoc,String hdfsBasePath){
    JavaSparkContext jsc = new JavaSparkContext(new SparkContext(new SparkConf()));
    JavaPairRDD<String, PortableDataStream> zipFilesRdd = jsc.binaryFiles(zipLoc);
    zipFilesRdd.collect().forEach(file -> {
        ZipInputStream zipStream = new ZipInputStream(file._2.open());
        ZipEntry zipEntry = null;
        Scanner sc = new Scanner(zipStream);
        try {
            while ((zipEntry = zipStream.getNextEntry()) != null) {
                String entryName = zipEntry.getName();
                if (!zipEntry.isDirectory()) {
                    //create the path in hdfs and write its contents
                   Configuration configuration = new Configuration();
                    configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
                    configuration.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName());
                    FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:8020"), configuration);
                    FSDataOutputStream hdfsfile = fs.create(new Path(hdfsBasePath + "/" + entryName));
                   while(sc.hasNextLine()){
                       hdfsfile.writeBytes(sc.nextLine());
                   }
                   hdfsfile.close();
                   hdfsfile.flush();
                }
                zipStream.closeEntry();
            }
        } catch (IllegalArgumentException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        } catch (IOException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        sc.close();
        //return fileNames.iterator();
    });
}

java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:8020/Logs, expected: file:/// — Praveen Mandadi, Dec 11 '17 at 07:26
where you geeting this exception.? seems like you running spark in local mode. give complete stack trace. And at which line you getting the exception. — Amit Kumar, Dec 11 '17 at 11:00
We are using spark library in maven, I am reading the zip file from local machine and extracting to hdfs — Praveen Mandadi, Dec 11 '17 at 11:06
Ok, if you reading the zip file from local, then you need to give the path with "file:///" as it saying in the exception. eg, jsc.binaryFiles("file:///localLocation") — Amit Kumar, Dec 11 '17 at 11:14
local file is able to read but while saving the file to hdfs getting error at FSDataOutputStream hdfsfile = FileSystem.get(jsc.hadoopConfiguration()).create(new Path(hdfsBasePath+"/"+entryName)); — Praveen Mandadi, Dec 11 '17 at 11:19
follow this, it will solve i guess, https://stackoverflow.com/questions/32078441/wrong-fs-expected-file-when-trying-to-read-file-from-hdfs-in-java. — Amit Kumar, Dec 11 '17 at 11:23
It is solved after adding this Configuration configuration = new Configuration(); configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName()); configuration.set("fs.file.impl", org.apache.hadoop.fs.LocalFileSystem.class.getName()); FileSystem fs = FileSystem.get(URI.create("hdfs://localhost:8020"), configuration); FSDataOutputStream hdfsfile = fs.create(new Path(hdfsBasePath + "/" + entryName)); — Praveen Mandadi, Dec 11 '17 at 11:31

score 2 · Answer 2 · edited Apr 05 '19 at 16:37

With gzip files, wholeTextFiles should gunzip everything automatically. With zip files however, the only way I know is to use binaryFiles and to unzip the data by hand.

sc
    .binaryFiles(hdfsDir)
    .mapValues(x=> { 
        var result = scala.collection.mutable.ArrayBuffer.empty[String]
        val zis = new ZipInputStream(x.open())
        var entry : ZipEntry = null
        while({entry = zis.getNextEntry();entry} != null) {
            val scanner = new Scanner(zis)
            while (sc.hasNextLine()) {result+=sc.nextLine()} 
        }
        zis.close()
        result
    }

This gives you a (pair) RDD[String, ArrayBuffer[String]] where the key is the name of the file on hdfs and the value the unzipped content of the zip file (one line per element of the ArrayBuffer). If a given zip file contains more than one file, everything is aggregated. You may adapt the code to fit your exact needs. For instance, flatMapValues instead of mapValues would flatten everything (RDD[String, String]) to take advantage of spark's parallelism.

Note also that in the while condition, "{entry = is.getNextEntry();entry} could be replaced by (entry = is.getNextEntry()) in java. In scala however the result of an affectation is Unit so this would yield an infinite loop.

score 0 · Answer 3 · answered Aug 09 '18 at 11:52

Come up with this solution written in Scala.

Tested with spark2 (version 2.3.0.cloudera2), scala (version 2.11.8)

def extractHdfsZipFile(source_zip : String, target_folder : String,
    sparksession : SparkSession) : Boolean = {

    val hdfs_config = sparksession.sparkContext.hadoopConfiguration
    val buffer = new Array[Byte](1024)

    /*
     .collect -> run on driver only, not able to serialize hdfs Configuration
    */
    val zip_files = sparksession.sparkContext.binaryFiles(source_zip).collect.
      foreach{ zip_file: (String, PortableDataStream) =>
        // iterate over zip_files
        val zip_stream : ZipInputStream = new ZipInputStream(zip_file._2.open)
        var zip_entry: ZipEntry = null

        try {
          // iterate over all ZipEntry from ZipInputStream
          while ({zip_entry = zip_stream.getNextEntry; zip_entry != null}) {
            // skip directory
            if (!zip_entry.isDirectory()) {
              println(s"Extract File: ${zip_entry.getName()}, with Size: ${zip_entry.getSize()}")
              // create new hdfs file
              val fs : FileSystem = FileSystem.get(hdfs_config)
              val hdfs_file : FSDataOutputStream = fs.create(new Path(target_folder + "/" + zip_entry.getName()))

              var len : Int = 0
              // write until zip_stream is null
              while({len = zip_stream.read(buffer); len > 0}) {
                hdfs_file.write(buffer, 0, len)
              }
              // close and flush hdfs_file
              hdfs_file.close()
              hdfs_file.flush()
            }
            zip_stream.closeEntry()
          }
          zip_stream.close()
        } catch {
          case zip : ZipException => {
            println(zip.printStackTrace)
            println("Please verify that you do not use compresstype9.")
            // for DEBUG throw exception
            //false
            throw zip
          }
          case e : Exception => {
            println(e.printStackTrace)
            // for DEBUG throw exception
            //false
            throw e
          }
        }
    }
    true
  }

How to unzip the files stored in hdfs using spark java

3 Answers3