2

I have a simple scalding program to transform some data which I execute using com.twitter.scalding.Tool in local mode.

val start = System.nanoTime    

val inputPaths = args("input").split(",").toList
val pipe = Tsv(inputPaths(0))
// standard pipe operations on my data like .filter( 'myField ), etc.
.write(Tsv(args("output")))

println("running time: " + (System.nanoTime - start) / 1e6 + "ms")

I would like to measure the running time of the program. I write the standard trick of measuring time at the beginning and end of the code, however, the result is ~100 ms, while the actual time is closer to 60 s. What is the best way to do this? Thanks!

Yuri Brovman
  • 1,093
  • 2
  • 12
  • 17

2 Answers2

1

One approach that has worked me is to use Micro Benchmarks.

Currently for Scala programs you can use http://scalameter.github.io/

It takes into account GC as well as warming up the JVM. I think should work in local mode on a single JVM.

Soumya Simanta
  • 11,523
  • 24
  • 106
  • 161
  • Thanks for your answer! I am wondering if there is something more simple without using another package? Why does my original solution not work? – Yuri Brovman Dec 02 '14 at 18:56
  • 1
    Your solution should work and give you a good idea of performance. Using Scalameter is really easy if you are using sbt as your build tool. Even Scalameter is not perfect. I believe it's better because they give some consideration to GC and JVM warming. Additionally you can execute multiple of these to get a good measure of your execution times. – Soumya Simanta Dec 02 '14 at 19:14
1

I found a simple answer. Add time keyword before the hadoop command when running a job.

time hadoop jar myjob.jar ...
Yuri Brovman
  • 1,093
  • 2
  • 12
  • 17