3

I have a copy of Programming MapReduce with Scalding by Antonios Chalkiopoulos. In the book he discusses the External Operations design pattern for Scalding code. You can see an example on his website here. I have made a choice to use the Type Safe API. Naturally, this introduces new challenges but I prefer it over the Fields API which is what is heavily discussed in the book I have previously mentioned and the site.

I am wondering how people have implemented the external operations pattern with the Type Safe API. My initial implementation is as follows:

I create a class that extends com.twitter.scalding.Job which will serve as my Scalding job class where I will 'manage arguments, define taps, and use external operations to construct data processing pipelines'.

I create an object where I define my functions to be used in the Type Safe pipes. Because the Type Safe pipes take as arguments a function, I can then just pass the functions in the object as arguments to the pipes.

This creates code that looks like this:

class MyJob(args: Args) extends Job(args) {

  import MyOperations._

  val input_path = args(MyJob.inputArgPath)
  val output_path = args(MyJob.outputArgPath)

  val eventInput: TypedPipe[(LongWritable, Text)] = this.mode match {
    case m: HadoopMode => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
    case _ => TypedPipe.from(WritableSequenceFile[LongWritable, Text](input_path))
  }

  val eventOutput: FixedPathSource with TypedSink[(LongWritable, Text)] with TypedSource[(LongWritable, Text)] = this.mode match {
    case m: HadoopMode => WritableSequenceFile[LongWritable, Text](output_path)
    case _ => TypedTsv[(LongWritable, Text)](output_path)
  }

  val validatedEvents: TypedPipe[(LongWritable, Either[Text, Event])] = eventInput.map(convertTextToEither).fork
  validatedEvents.filter(isEvent).map(removeEitherWrapper).write(eventOutput)
}

object MyOperations {

  def convertTextToEither(v: (LongWritable, Text)): (LongWritable, Either[Text, Event]) = {
    ...
  }

  def isEvent(v: (LongWritable, Either[Text, Event])): Boolean = {
    ...
  }

  def removeEitherWrapper(v: (LongWritable, Either[Text, Event])): (LongWritable, Text) = {
    ...
  }
}

As you can see, the functions that are passed to the Scalding Type Safe operations are kept separate from the job itself. While this is not as 'clean' as the external operations pattern presented, this is a quick way to write this kind of code. Additionally, I can use JUnitRunner for doing job level integration tests and ScalaTest for function level unit tests.

The main point of this post though is to ask how people are doing this sort of thing? The documentation around the internet for Scalding Type Safe API is sparse. Are there more functional Scala friendly ways for doing this? Am I missing a key component here for the design pattern? I sort of feel nervous about this because with the Fields API you can write unit tests on pipes with ScaldingTest. As far as I know, you can't do that with TypedPipes. Please let me know if there is a generally agreed upon pattern for Scalding Type Safe API or how you create reusable, modular, and testable Type Safe API code. Thanks for the help!

Update 2 after Antonios' reply

Thank you for the reply. That was basically the answer I was looking for. I wanted to continue the conversation. The main issue I see in your answer as I commented was that this implementation expects a specific type implementation but what if the types change throughout your job? I have explored this code and it seems to work but it seems hacked on.

def self: TypedPipe[Any]

def testingPipe: TypedPipe[(LongWritable, Text)] = self.map(
    (firstVar: Any) => {
        val tester = firstVar.asInstanceOf[(LongWritable, Text)]
        (tester._1, tester._2)
    }
)

The upside to this is I declare one implementation of self but the downside is this ugly type casting. Additionally, I have not tested this out in depth with a more complex pipeline. So basically, what are your thoughts on how to handle types as they change with only one self implementation for cleanliness/brevity?

PhillipAMann
  • 887
  • 1
  • 10
  • 19

2 Answers2

2

Scala extension methods are implemented using implicit classes. You add to the compiler the capability of converting a TypedPipe into a (wrapper) class that contains your external operations:

import com.twitter.scalding.TypedPipe
import com.twitter.scalding._
import cascading.flow.FlowDef

class MyJob(args: Args) extends Job(args) {

  implicit class MyOperationsWrapper(val self: TypedPipe[Double]) extends MyOperations with Serializable

  val pipe = TypedPipe.from(TypedTsv[Double](args("input")))

  val result = pipe
    .operation1
    .operation2(x => x*2)
    .write(TypedTsv[Double](args("output")))

}

trait MyOperations {

  def self: TypedPipe[Double]

  def operation1(implicit fd: FlowDef): TypedPipe[Double] =
    self.map { x =>
      println(s"Input: $x")
      x / 100
    }

  def operation2(datafn:Double => Double)(implicit fd: FlowDef): TypedPipe[Double] =
    self.map { x=>
      val result = datafn(x)
      println(s"Result: $result")
      result
    }

}

import org.apache.hadoop.util.ToolRunner
import org.apache.hadoop.conf.Configuration

object MyRunner extends App {

  ToolRunner.run(new Configuration(), new Tool, (classOf[MyJob].getName :: "--local" ::
    "--input" :: "doubles.tsv" ::
    "--output":: "result.tsv" :: args.toList).toArray)

}

Regarding how to manage types across the pipes, my recommendation would be to try to work out some basic types that make sense and use case classes. To use your example i would rename the method convertTextToEither into extractEvents :

case class LogInput(l : Long, text: Text)
case class Event(data: String)
def extractEvents( line : LogInput ): TypedPipe[Event] =
  self.filter( isEvent(line) )
      .map ( getEvent(line.text) ) 

Then you would have

  • LogInputOperations for LogInput types
  • EventOperations for Event types
  • Hello Antonios, I am honored you gave me an answer. There is one issue with this kind of implementation. What if your types change as you run operations on data? In the case of my pipeline, the pipes all work with tuples but the tuples pass different data types as transformations occur. I will experiment a bit with what you posted though. Thanks again! – PhillipAMann Jan 09 '16 at 23:31
  • I used your code example and was receiving cascading.flow.planner.PlannerException: could not build flow from assembly: [Neither Java nor Kyro works for class: class com.twitter.scalding.typed.MapFn instance: export CHILL_EXTERNALIZER_DEBUG=true to see both stack traces] – PhillipAMann Jan 13 '16 at 01:02
  • 1
    I added to my answer a trivial runner. Tested with scalding 0.10 and getting correct results. i.e. For input: 10.0 the output is 0.2 – Antonios Chalkiopoulos Jan 13 '16 at 10:35
  • I wish I could upvote more. I'll try this as soon as I come into the office. Thanks for the suggestion on case classes and operations on certain case classes. I am a Scala neophyte as well which makes this doubly challenging but it's great to learn new things. I'll report back how things work out. – PhillipAMann Jan 13 '16 at 17:03
  • Hi Antonios, last question but why the implicit FlowDef? Where did you get the intuition to add this? To get counters to work with our Fields API pipes, we use (implicit uuid: Option[UniqueID]). I am trying to get counters to work now as well. I have not seen any reference to this with TypeSafe API before. – PhillipAMann Jan 13 '16 at 17:57
  • I have solved the issue with my exception. It relates to ObjectMapper and how I called it inside of my ExternalOperations trait and how Cascading/Scalding serializes it. The clue my colleague showed me was related to this line: at com.twitter.chill.Externalizer.maybeWriteJavaKryo(Externalizer.scala:182). We will compile our notes and observations on Scalding gotchas and write a blog post to benefit the community. – PhillipAMann Jan 13 '16 at 20:22
1

I am not sure what is the problem you see with the snippet you showed, and why you think it is "less clean". It looks fine to me.

As for the unit testing jobs using typed API question, take a look at JobTest, it seems to be just what you are looking for.

Dima
  • 39,570
  • 6
  • 44
  • 70
  • Maybe I am really neurotic about code quality. Thanks for your feedback. I will look into JobTest in the next day when I write my unit tests. – PhillipAMann Jan 08 '16 at 22:29