0

I work at a place where scalding writes are augmented with a specific API to track dataset meta data. When converting from normal writes to these special writes, there are some intricacies with respect to Key/Value, TSV/CSV, Thrift ... datasets. I would like to compare the binary file is the same prior to conversion and after conversion to the special API.

Given I cannot provide the specific api for the metadata-inclusive writes, I only ask how can I write a unit test for .write method on a TypedPipe?

implicit val timeZone: TimeZone = DateOps.UTC
implicit val dateParser: DateParser = DateParser.default
implicit def flowDef: FlowDef = new FlowDef()
implicit def mode: Mode = Local(true)

val fileStrPath = root + "/test"

println("writing data to " + fileStrPath)

TypedPipe
  .from(Seq[Long](1, 2, 3, 4, 5))
  // .map((x: Long) => { println(x.toString); System.out.flush(); x })
  .write(TypedTsv[Long](fileStrPath))
  .forceToDisk

The above doesn't seem to write anything to local (OSX) disk.

So I wonder if I need to use a MiniDFSCluster something like this:

def setUpTempFolder: String = {
  val tempFolder = new TemporaryFolder
  tempFolder.create()
  tempFolder.getRoot.getAbsolutePath
}
val root: String = setUpTempFolder
println(s"root = $root")
val tempDir = Files.createTempDirectory(setUpTempFolder).toFile
val hdfsCluster: MiniDFSCluster = {
  val configuration = new Configuration()
  configuration.set(MiniDFSCluster.HDFS_MINIDFS_BASEDIR, tempDir.getAbsolutePath)
  configuration.set("io.compression.codecs", classOf[LzopCodec].getName)
  new MiniDFSCluster.Builder(configuration)
    .manageNameDfsDirs(true)
    .manageDataDfsDirs(true)
    .format(true)
    .build()
}
hdfsCluster.waitClusterUp()
val fs: DistributedFileSystem = hdfsCluster.getFileSystem
val rootPath = new Path(root)
fs.mkdirs(rootPath)

However, my attempts to get this MiniCluster to work haven't panned out either - somehow I need to link the MiniCluster with the Scalding write.

Note: The Scalding JobTest framework for unit testing isn't going to work due actual data written is sometimes wrapped in bijection codec or setup with case class wrappers prior to the writes made by the metadata-inclusive writes APIs.

Any ideas how I can write a local file (without using the Scalding REPL) with either Scalding alone or a MiniCluster? (If using the later, I need a hint how to read the file.)

codeaperature
  • 1,089
  • 2
  • 10
  • 25

1 Answers1

0

Answering ... There is an example of how to use a mini cluster for exactly reading and writing to HDFS. I will be able to cross read with my different writes and examine them. Here it is in the tests for scalding's TypedParquet type

HadoopPlatformJobTest is an extension for JobTest that uses a MiniCluster. With some hand-waiving on detail in the link, the bulk of the code is this:

"TypedParquetTuple" should { "read and write correctly" in { import com.twitter.scalding.parquet.tuple.TestValues._ def toMap[T](i: Iterable[T]): Map[T, Int] = i.groupBy(identity).mapValues(_.size)

  HadoopPlatformJobTest(new WriteToTypedParquetTupleJob(_), cluster)
    .arg("output", "output1")
    .sink[SampleClassB](TypedParquet[SampleClassB](Seq("output1"))) {
      toMap(_) shouldBe toMap(values)
    }
    .run()

  HadoopPlatformJobTest(new ReadWithFilterPredicateJob(_), cluster)
    .arg("input", "output1")
    .arg("output", "output2")
    .sink[Boolean]("output2")(toMap(_) shouldBe toMap(values.filter(_.string == "B1").map(_.a.bool)))
    .run()
  }
}
codeaperature
  • 1,089
  • 2
  • 10
  • 25