I experience an issue these days, i am trying to read from multiple files using scalding and create an output with a single file. My code is this:
def getFilesSource (paths: Seq[String]) = {
new MultipleTextLineFiles(paths: _*) {
override protected def createHdfsReadTap(hdfsMode: Hdfs): Tap[JobConf, _, _] = {
val taps = goodHdfsPaths(hdfsMode).toList.map {
path => CastHfsTap (new Hfs (hdfsScheme, path, sinkMode))
}
taps.size match {
case 0 => {
CastHfsTap (new Hfs(hdfsScheme, hdfsPaths.head, sinkMode))
}
case 1 => taps.head
case _ => new ScaldingMultiSourceTap(taps)
}
}
}
}
But when I run this code, it splits my output into MANY files, but data inside is very little: just a few K. Instead I want to be able to aggregate all output files into a single one.
My scalding code is:
val source = getFilesSource(mapped) // where mapped is a Sequence of valid HDFS paths (Seq [String])
TypedPipe.from(source).map(a => Try{
val json = JSON.parseObject(a)
(json.getInteger("prop1"), json.getInteger("prop2"), json.getBoolean("prop3"))
}.toOption).filter(a => a.nonEmpty)
.map(a => a.get)
.filter(a => !a._3)
.map (that => MyScaldingType (that._1, that._2))
.write(MyScaldingType.typedSink(typedArgs))
I guess I have to override the "sourceConfInit" method of type ScaldingMultiSourceTap but I don't know what to write inside ...