0

In my Scalding job, I have code like this:

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat

class MyJob(args: Args) extends Job(args) {
  FileInputFormat.setInputPathFilter(???, classOf[MyFilter])
  // ... rest of job ...
}

class MyFilter extends PathFilter {
  def accept(path:Path): Boolean = true
}

My problem is that the first argument of the FileInputFormat.setInputPathFilter method needs to be of type org.apache.hadoop.mapreduce.Job. How can I access the Hadoop job object in my Scalding job?

fblundun
  • 987
  • 7
  • 19

1 Answers1

0

Disclaimer

There is no way to extract Job class. But you can (but should never do!) extract JobConf. After that you will be able to use FileInputFormat.setInputPathFilter from mapreduce.v1 API (org.apache.hadoop.mapred.JobConf) which will permit to archive the filtering.

But I suggest you not to do this. Read the end of the answer,

How can you do this?

Override stepStrategy method of scalding.Job to implement FlowStepStrategy. For example this implementation permits to change the name of mapreduce job

override def stepStrategy: Option[FlowStepStrategy[_]] = Some(new FlowStepStrategy[AnyRef]{
  override def apply(flow: Flow[AnyRef], predecessorSteps: util.List[FlowStep[AnyRef]], step: FlowStep[AnyRef]): Unit =
    step.getConfig match {
      case conf: JobConf =>
        # here you can modify the JobConf of each job.
        conf.setJobName(...)
      case _ =>
    }
})

Why should one not do this?

Accessing JobConf to add a path filtering will work only if you are using the specific Sources and will break if you are using some others. Also you will be mixing different levels of abstraction. And I am not starting on how are you suppose to know what JobConf you actually need to modify (most of scalding jobs I saw are multi-steps)

How should one resolve this problem?

I suggest you to look closely on a type of Source you are using. I am pretty sure there is a function to apply a path filtering there during or before Pipe (or TypedPipe) construction.

Oleksandr Pryimak
  • 1,561
  • 9
  • 11