Reading RDF in apache spark

Question

I'm trying to read RDF\XML file into Apache spark (scala 2.11, apache spark 1.4.1) using Apache Jena. I wrote this scala snippet:

val factory = new RdfXmlReaderFactory()
HadoopRdfIORegistry.addReaderFactory(factory)
val conf = new Configuration()
conf.set("rdf.io.input.ignore-bad-tuples", "false")
val data = sc.newAPIHadoopFile(path,
    classOf[RdfXmlInputFormat],
    classOf[LongWritable], //position
    classOf[TripleWritable],   //value
    conf)
data.take(10).foreach(println)

But it throws an error:

INFO readers.AbstractLineBasedNodeTupleReader: Got split with start 0 and length 21765995 for file with total length of 21765995
15/07/23 01:52:42 ERROR readers.AbstractLineBasedNodeTupleReader: Error parsing whole file, aborting further parsing
org.apache.jena.riot.RiotException: Producer failed to ever call start(), declaring producer dead
        at org.apache.jena.riot.lang.PipedRDFIterator.hasNext(PipedRDFIterator.java:272)
        at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:242)
        at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
   ...
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Error parsing whole file at position 0, aborting further parsing
        at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:285)
        at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
        at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
        at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
        at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)

The file is good, because i can parse it locally. What do I miss?

EDIT Some information to reproduce the behaviour

Imports:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.jena.hadoop.rdf.io.registry.HadoopRdfIORegistry
import org.apache.jena.hadoop.rdf.io.registry.readers.RdfXmlReaderFactory
import org.apache.jena.hadoop.rdf.types.QuadWritable
import org.apache.spark.SparkContext

scalaVersion := "2.11.7"

dependencies:

"org.apache.hadoop"             % "hadoop-common"      % "2.7.1",
"org.apache.hadoop"             % "hadoop-mapreduce-client-common" % "2.7.1",
"org.apache.hadoop"             % "hadoop-streaming"   % "2.7.1", 
"org.apache.spark"              % "spark-core_2.11"  % "1.4.1", 
"com.hp.hpl.jena"               % "jena"               % "2.6.4",
"org.apache.jena"               % "jena-elephas-io"    % "0.9.0",
"org.apache.jena"               % "jena-elephas-mapreduce" % "0.9.0"

I'm using sample rdf from here. It's freely available information about John Peel sessions (more info about dump).

The fact that you get that specific error implies that parsing was never started, this is most likely because the file was not accessible. You haven't told us the value of `path` but that would be the first thing to check — RobV, Jul 23 '15 at 08:17
@RobV If file not exists i get `org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist:`, but this is another case. it starts to read the file: `Got split with start 0 and length 21765995` — Nikita, Jul 23 '15 at 09:03
Well without further information it is impossible to determine what the issue is, your code is incomplete (no imports statements) and no value for `path` so can't be run as-is especially since we don't have your data. The same code (with appropriate imports statements) works in my environment so likely there is some issue with your environment (e.g. Spark version) or your data but impossible to say as it stands — RobV, Jul 23 '15 at 09:05
@RobV Thank you for your time, anyway. I added library deps and link to rdf\xml example file. — Nikita, Jul 23 '15 at 09:48
Can you try reducing the file to a single line, i.e. remove all `\n`? — Marius Soutier, Jul 23 '15 at 10:16
@MariusSoutier Thx for the good guess! but it didn't help :(. As i understand The problem with input stream at position 0. It looks like the underlying pipeIterator can't proceed through stream. — Nikita, Jul 23 '15 at 10:53

score 3 · Accepted Answer · answered Jul 23 '15 at 15:44

So it appears your problem was down to you manually managing your dependencies.

In my environment I was simply passing the following to my Spark shell:

--packages org.apache.jena:jena-elephas-io:0.9.0

This does all the dependency resolution for you

If you are building a SBT project then it should be sufficient to do the following in your build.sbt:

libraryDependencies += "org.apache.jena" % "jena-elephas-io" % "0.9.0"

Nikita · Answer 2 · 2015-07-23T14:03:13.207

Thx all for discussion in comments. The problem was really tricky and not clear from the stack trace: code needs one extra dependency to work jena-core and this dependency must be packaged first.

"org.apache.jena" % "jena-core" % "2.13.0"
"com.hp.hpl.jena" % "jena"      % "2.6.4"

I use this assembly strategy:

lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
  case PathList("META-INF", xs @ _*) =>
    (xs map {_.toLowerCase}) match {
      case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
      case _ => MergeStrategy.discard
    }
  case x => MergeStrategy.first
}
}

Please don't mix Jena 2.6.4 and Jena 2.13.0, the former is from 2010 while the latter is from 2015 and mixing the two is likely to lead to nasty things happening — RobV, Jul 23 '15 at 15:46

Reading RDF in apache spark

2 Answers2