With gzip files, wholeTextFiles should gunzip everything automatically.
With zip files however, the only way I know is to use binaryFiles and to unzip the data by hand.
sc
.binaryFiles(hdfsDir)
.mapValues(x=> {
var result = scala.collection.mutable.ArrayBuffer.empty[String]
val zis = new ZipInputStream(x.open())
var entry : ZipEntry = null
while({entry = zis.getNextEntry();entry} != null) {
val scanner = new Scanner(zis)
while (sc.hasNextLine()) {result+=sc.nextLine()}
}
zis.close()
result
}
This gives you a (pair) RDD[String, ArrayBuffer[String]] where the key is the name of the file on hdfs and the value the unzipped content of the zip file (one line per element of the ArrayBuffer). If a given zip file contains more than one file, everything is aggregated. You may adapt the code to fit your exact needs. For instance, flatMapValues instead of mapValues would flatten everything (RDD[String, String]) to take advantage of spark's parallelism.
Note also that in the while condition, "{entry = is.getNextEntry();entry} could be replaced by (entry = is.getNextEntry()) in java. In scala however the result of an affectation is Unit so this would yield an infinite loop.