Disclaimer
There is no way to extract Job
class. But you can (but should never do!) extract JobConf
. After that you will be able to use FileInputFormat.setInputPathFilter
from mapreduce.v1 API (org.apache.hadoop.mapred.JobConf
) which will permit to archive the filtering.
But I suggest you not to do this. Read the end of the answer,
How can you do this?
Override stepStrategy
method of scalding.Job
to implement FlowStepStrategy
. For example this implementation permits to change the name of mapreduce job
override def stepStrategy: Option[FlowStepStrategy[_]] = Some(new FlowStepStrategy[AnyRef]{
override def apply(flow: Flow[AnyRef], predecessorSteps: util.List[FlowStep[AnyRef]], step: FlowStep[AnyRef]): Unit =
step.getConfig match {
case conf: JobConf =>
# here you can modify the JobConf of each job.
conf.setJobName(...)
case _ =>
}
})
Why should one not do this?
Accessing JobConf to add a path filtering will work only if you are using the specific Sources and will break if you are using some others. Also you will be mixing different levels of abstraction. And I am not starting on how are you suppose to know what JobConf you actually need to modify (most of scalding jobs I saw are multi-steps)
How should one resolve this problem?
I suggest you to look closely on a type of Source
you are using. I am pretty sure there is a function to apply a path filtering there during or before Pipe
(or TypedPipe
) construction.