3

I experience an issue these days, i am trying to read from multiple files using scalding and create an output with a single file. My code is this:

def getFilesSource (paths: Seq[String]) = {
    new MultipleTextLineFiles(paths: _*) {
      override protected def createHdfsReadTap(hdfsMode: Hdfs): Tap[JobConf, _, _] =  {
        val taps = goodHdfsPaths(hdfsMode).toList.map {
          path => CastHfsTap (new Hfs (hdfsScheme, path, sinkMode))
        }

        taps.size match {
          case 0 => {
            CastHfsTap (new Hfs(hdfsScheme, hdfsPaths.head, sinkMode))
          }
          case 1 => taps.head
          case _ => new ScaldingMultiSourceTap(taps)
        }
      }
    }
  }

But when I run this code, it splits my output into MANY files, but data inside is very little: just a few K. Instead I want to be able to aggregate all output files into a single one.

My scalding code is:

val source = getFilesSource(mapped) // where mapped is a Sequence of valid HDFS paths (Seq [String])

TypedPipe.from(source).map(a => Try{
  val json = JSON.parseObject(a)
  (json.getInteger("prop1"), json.getInteger("prop2"), json.getBoolean("prop3"))
}.toOption).filter(a => a.nonEmpty)
  .map(a => a.get)
  .filter(a => !a._3)
  .map (that => MyScaldingType (that._1, that._2))
  .write(MyScaldingType.typedSink(typedArgs))

I guess I have to override the "sourceConfInit" method of type ScaldingMultiSourceTap but I don't know what to write inside ...

Ani Menon
  • 27,209
  • 16
  • 105
  • 126
George Lica
  • 1,798
  • 1
  • 12
  • 23

1 Answers1

0

You can use groupAll to send all the map outputs (the job is a map only job) to a single reducer, considering the data is small, then do a write. The output will be written to a single file.

.
.
.
.filter(a => !a._3)
.map (that => MyScaldingType (that._1, that._2))
.groupAll
.write(MyScaldingType.typedSink(typedArgs))
karthikcru
  • 67
  • 8
  • Hi @karthikcru, thanks for answering, sounds promising. I will try this today morning. I hope that since I perform a filter in the map phase ( using the statement: .filter(a => !a._3) ), for my business case there will be a lot of data that will not pass that filter criteria.The remaining stuff will be sent to the single reducer. – George Lica Dec 28 '16 at 22:15