2

edit 2

Indirectly solved the problem by repartitioning the RDD into 8 partitions. Hit a roadblock with avro objects not being "java serialisable" found a snippet here to delegate avro serialisation to kryo. The original problem still remains.

edit 1: Removed local variable reference in map function

I'm writing a driver to run a compute heavy job on spark using parquet and avro for io/schema. I can't seem to get spark to use all my cores. What am I doing wrong ? Is it because I have set the keys to null ?

I am just getting my head around how hadoop organises files. AFAIK since my file has a gigabyte of raw data I should expect to see things parallelising with the default block and page sizes.

The function to ETL my input for processing looks as follows :

def genForum {
    class MyWriter extends AvroParquetWriter[Topic](new Path("posts.parq"), Topic.getClassSchema) {
      override def write(t: Topic) {
        synchronized {
          super.write(t)
        }
      }
    }

    def makeTopic(x: ForumTopic): Topic = {
      // Ommited to save space
    }

    val writer = new MyWriter

    val q =
      DBCrawler.db.withSession {
        Query(ForumTopics).filter(x => x.crawlState === TopicCrawlState.Done).list()
      }

    val sz = q.size
    val c = new AtomicInteger(0)

    q.par.foreach {
      x =>
        writer.write(makeTopic(x))
        val count = c.incrementAndGet()
        print(f"\r${count.toFloat * 100 / sz}%4.2f%%")
    }
    writer.close()
  }

And my transformation looks as follows :

def sparkNLPTransformation() {
    val sc = new SparkContext("local[8]", "forumAddNlp")

    // io configuration
    val job = new Job()
    ParquetInputFormat.setReadSupportClass(job, classOf[AvroReadSupport[Topic]])
    ParquetOutputFormat.setWriteSupportClass(job,classOf[AvroWriteSupport])
    AvroParquetOutputFormat.setSchema(job, Topic.getClassSchema)


    // configure annotator
    val props = new Properties()
    props.put("annotators", "tokenize,ssplit,pos,lemma,parse")
    val an = DAnnotator(props)


    // annotator function
    def annotatePosts(ann : DAnnotator, top : Topic) : Topic = {
      val new_p = top.getPosts.map{ x=>
        val at = new Annotation(x.getPostText.toString)
        ann.annotator.annotate(at)
        val t = at.get(classOf[SentencesAnnotation]).map(_.get(classOf[TreeAnnotation])).toList

        val r = SpecificData.get().deepCopy[Post](x.getSchema,x)
        if(t.nonEmpty) r.setTrees(t)
        r
      }
      val new_t = SpecificData.get().deepCopy[Topic](top.getSchema,top)
      new_t.setPosts(new_p)
      new_t
    }

    // transformation
    val ds = sc.newAPIHadoopFile("forum_dataset.parq", classOf[ParquetInputFormat[Topic]], classOf[Void], classOf[Topic], job.getConfiguration)
    val new_ds = ds.map(x=> ( null, annotatePosts(x._2) ) )

    new_ds.saveAsNewAPIHadoopFile("annotated_posts.parq",
      classOf[Void],
      classOf[Topic],
      classOf[ParquetOutputFormat[Topic]],
      job.getConfiguration
    )
  }
Hassan Syed
  • 20,075
  • 11
  • 87
  • 171
  • I've reached conclusion it's not currently possible to split parquet files using Spark and one must use a Hadoop job to split the files by setting the number of reducers (which can be quite fast, but a horrible hack). I asked a similar question http://stackoverflow.com/questions/27194333/how-to-split-parquet-files-into-many-partitions-in-spark – samthebest Dec 03 '14 at 17:06

1 Answers1

0

Can you confirm that the data is indeed in multiple blocks in HDFS? The total block count on the forum_dataset.parq file

Andrew
  • 3,272
  • 2
  • 25
  • 26
  • Thanks for replying <3. hmm so the data has to be in an HDFS file system even when running locally ? I am looking around on how best to confirm the block size for you. – Hassan Syed Feb 02 '14 at 17:40
  • Also I started toying around with partition, and from the console spew I am seeing this seems to be fanning out the work (though I think I need to write custom serialisation code for kryo for my avro objects since nothing is happening). My code is here https://gist.github.com/hsyed/8771986 – Hassan Syed Feb 02 '14 at 17:45
  • From the console output I guess it is 4. `14/02/02 17:47:41 INFO rdd.NewHadoopRDD: Input split: ParquetInputSplit{part: file:///Users/hassan/code/scala/avro/forum_dataset.parq start: 0 length: 1023817737 hosts: [localhost] blocks: 4 requestedSchema: same as file fileSchema: message forumavroschema.Topic ` – Hassan Syed Feb 02 '14 at 17:50