1

I have a Scrunch Spark pipeline, and when I try to save its output to Avro format using:

data.write(to.avroFile(path))

I get the following Exception:

java.lang.ClassCastException: org.apache.crunch.types.writable.WritableType cannot be cast to org.apache.crunch.types.avro.AvroType
at org.apache.crunch.io.avro.AvroFileTarget.configureForMapReduce(AvroFileTarget.java:77) ~[crunch-core-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime.monitorLoop(SparkRuntime.java:327) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime.access$000(SparkRuntime.java:80) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]
at org.apache.crunch.impl.spark.SparkRuntime$2.run(SparkRuntime.java:139) [crunch-spark-0.14.0-SNAPSHOT.jar:0.14.0-SNAPSHOT]

while this instead works just fine:

data.write(to.textFile(path))

(in both cases data being the same instance of PCollection and path a String)

I understand why that error happens, the PCollection I am trying to write belongs to the Writable type family instead of the Avro one. What is not clear to me is how it gets decided in Scrunch that my PCollection belongs to one and not the other.

This mechanism however, seems to be somewhat more clear in Crunch. According to the official Crunch documentation:

Crunch supports two different type families, which each implement the PTypeFamily interface: one for Hadoop's Writable interface and another based on Apache Avro. There are also classes that contain static factory methods for each PTypeFamily to allow for easy import and usage: one for Writables and one for Avros.

And then:

For most of your pipelines, you will use one type family exclusively, and so you can cut down on some of the boilerplate in your classes by importing all of the methods from the Writables or Avros classes into your class

import static org.apache.crunch.types.avro.Avros.*;

In fact, in the examples provided for Crunch in the official repo, it can be seen how this is made explicit. See the following code snippet from WordCount example:

PCollection<String> lines = pipeline.readTextFile(args[0]);

PCollection<String> words = lines.parallelDo(new DoFn<String, String>() {
  public void process(String line, Emitter<String> emitter) {
    for (String word : line.split("\\s+")) {
      emitter.emit(word);
    }
  }
}, Writables.strings()); // Indicates the serialization format

PTable<String, Long> counts = words.count();

While the equivalent Scrunch version goes like this:

 def countWords(file: String) = {
read(from.textFile(file))
  .flatMap(_.split("\\W+").filter(!_.isEmpty()))
  .count

}

And no explicit, or as far as I can see, implicit reference to the WritableFamily is provided.

So how does Scrunch decides what Writable family type to use? is it based on the default of the original input Source? (e.g. if reads from a text file, it's Writable, if from Avro then Avro) If that is the case, then how could I change the type to read from one source and write to a target taht belongs to a different family type in Scrunch?

Community
  • 1
  • 1
djsecilla
  • 390
  • 3
  • 12

0 Answers0