Long time lurker, first time poster. Please let me know if my question is not clear.
I have a kinda strange XML file that needs to be parsed (put data inside a class and handle it internally). I mention it's strange because what anyone would normally think that should be nested, it is not. Let me give you an example:
<root>
<item id="XX" rank="YY">
<top>
<description a="XX"><s>***</s>content<s>+++</s></description>
<mainterm t="XX">term</mainterm>
<description a="YY">more content</description>
</top>
<!-- All examples directly below a data correspond to that data.
Shouldn't they be nested? -->
<data level="10" x="4">data here </data>
<example f="45"> example 1</example>
<example f="12"> example 2</example>
<example f="44"> example 3</example>
<data level="11" x="1">data here </data>
<example f="33"> example 1</example>
<example f="6"> example 2</example>
<example f="18"> example 3</example>
<!-- More data tags with and without examples below -->
</item>
</root>
The file continues with some tens of thousand of items. There are items that don't contain data at all, while there are data tags with nothing inside them.
I was given complete freedom about how to parse this, and since I'm trying to master Scala, I chose it to solve the task. As in the past I have used StAX (Apache AXIOM) to pull-parse, I was looking for something similar in Scala and I found Scales. So far, so good.
Using:
- Scala 2.10.2
- Scales 0.4.5
What I need is not only the contents and attributes of each tag, but also each tag's raw content. For instance, in the above XML I would have something like the following for the "top" tag:
case class Top(descriptions:List[Description], key: Key, rawContent: String)
where rawContent would be:
<description a="XX"><s>***</s>content<s>+++</s></description>
<mainterm t="XX">term</mainterm>
<description a="YY">more content</description
The same applies for the "data" tag, but as the data is not nested, and given that pull-parsing gives you a XmlPull that it's just an Iterator[PullType], I came with the idea to parse a tag, traverse the nodes until I find the closing tag or, in the case of the "data" tag, until I find another "data" start tag or the "item" end tag. However, no matter how I think of the problem, I can't avoid saving state.
I decided to try with Zippers.
First of all, as I need to traverse until I find a given tag and do something in the meanwhile with each element I found, I'm trying with findBy. Below is the code I'm trying right now. I try to retrieve the given tag's attributes and the raw Content of everything inside it.
/* Some helpers. Ignore */
class PullTypeValue(pt: PullType) {
private val NUM_DEL_CHARS = 2
// Tag names are returned like "{}tagname". Getting rid of the first 2 characters
private def stripHeadChars(s: String) = s.substring(NUM_DEL_CHARS)
// Get the tag name or the value of the given PullType
def getNameOrValue = pt match {
case Left(e:Elem) => stripHeadChars(e.name.toString)
case Left(i:XmlItem) => i.value
case Right(e) => stripHeadChars(e.name.toString)
}
}
class PullTypeZipper(z: Zipper[PullType]) {
implicit def toPullTypeValue(e: Elem) = new PullTypeValue(e)
def moveToTag(tag: String) = {
z.findNext(_ match {
case Left(e:Elem) => e.getNameOrValue == tag
case _ => false
})
}
}
implicit def toPulltTypeValue(pt: PullType) = new PullTypeValue(pt)
implicit def toPullTypeValue(e: Elem) = new PullTypeValue(e)
implicit def toPullTypeValue(i: XmlItem) = new PullTypeValue(i)
implicit def toPullTypeValue(e: EndElem) = new PullTypeValue(e)
implicit def toPullTypeZipper(z: Zipper[PullType]) = new PullTypeZipper(z)
/* End of helpers */
/************* Parsing function here *******************/
def parseTag(currentNode: Option[Zipper[PullType]], currentTagName: String) = {
var attrs: Map[String,String] = Map.empty
val ltags = ListBuffer[String]()
val getAttributes = (z: Zipper[PullType]) =>
z.focus match {
case Left(e:Elem) if e.getNameOrValue == currentTagName =>
attrs = e.attributes.map {a => (a.name.toString.substring(2), a.value)}.toMap
ltags += "<" + e.getNameOrValue + ">"
z.next
case Left(e:Elem) =>
ltags += "<" + e.getNameOrValue + ">"
z.next
case Left(t:Text) =>
ltags += t.value
z.next
case Left(i:XmlItem) =>
ltags += i.value
z.next
case Right(e) =>
ltags += "</" + e.getNameOrValue + ">"
(e.getNameOrValue == currentTagName) ? z.some | z.next
}
/* Traverse until finding the close tag for the given tag name
and extract raw contents from each found tag.
Return the zipper with focus on the next element (if any)
*/
val nextNode = currentNode >>= {_.findBy(getAttributes)(_ match {
case Right(e) => e.getNameOrValue == currentTagName
case _ => false
})} >>= {_.next}
(attrs,ltags.mkString(""),nextNode)
}
/************** End of parsing function ************************/
val zipper = pullXml(new FileReader("MyXmlFile.xml")).toStream.toZipper
val (attrs,rawContents,nextNode) = parseTag(zipper >>= {_.moveToTag("top")}, "top")
// Do something with the values...
The code works for the "top" tag, but if I try it with the "item" tag, I get the StackOverFlowError:
Exception in thread "main" java.lang.StackOverflowError
at com.ctc.wstx.util.SymbolTable.size(SymbolTable.java:332)
at com.ctc.wstx.util.SymbolTable.mergeChild(SymbolTable.java:291)
at com.ctc.wstx.stax.WstxInputFactory.updateSymbolTable(WstxInputFactory.java:202)
at com.ctc.wstx.sr.BasicStreamReader.close(BasicStreamReader.java:1179)
at scales.xml.XmlPulls$$anon$1.close(XmlPull.scala:134)
at scales.xml.XmlPulls$$anon$1.internalClose(XmlPull.scala:130)
at scales.xml.XmlPull$class.pumpEvent(PullIterator.scala:201)
at scales.xml.XmlPulls$$anon$1.pumpEvent(XmlPull.scala:118)
at scales.xml.XmlPull$class.next(PullIterator.scala:149)
at scales.xml.XmlPulls$$anon$1.next(XmlPull.scala:118)
at scales.xml.XmlPulls$$anon$1.next(XmlPull.scala:118)
at scala.collection.Iterator$class.toStream(Iterator.scala:1143)
at scales.xml.XmlPulls$$anon$1.toStream(XmlPull.scala:118)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.Iterator$$anonfun$toStream$1.apply(Iterator.scala:1143)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1085)
at scala.collection.immutable.Stream$Cons.tail(Stream.scala:1077)
at scala.collection.immutable.Stream$$hash$colon$colon$.unapply(Stream.scala:1058)
at scalaz.Zipper$class.next(Zipper.scala:45)
at scalaz.Zippers$$anon$1.next(Zipper.scala:269)
at parser.XMLParser$$anonfun$6$$anonfun$apply$7.apply(XMLParser.scala:258)
at parser.XMLParser$$anonfun$6$$anonfun$apply$7.apply(XMLParser.scala:258)
at scalaz.BooleanW$$anon$1.$bar(BooleanW.scala:142)
at parser.XMLParser$$anonfun$6.apply(XMLParser.scala:258)
at parser.XMLParser$$anonfun$6.apply(XMLParser.scala:237)
at scalaz.Zipper$class.findBy(Zipper.scala:178)
at scalaz.Zippers$$anon$1.findBy(Zipper.scala:269)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
at scala.Option.flatMap(Option.scala:170)
at scalaz.Bind$$anon$21.bind(Bind.scala:112)
at scalaz.Bind$$anon$21.bind(Bind.scala:111)
at scalaz.MA$class.$greater$greater$eq(MA.scala:73)
at scalaz.MAsLow$$anon$2.$greater$greater$eq(MAB.scala:50)
at scalaz.MASugar$class.$u2217(MA.scala:329)
at scalaz.MAsLow$$anon$2.$u2217(MAB.scala:50)
at scalaz.Zipper$class.findBy(Zipper.scala:178)
at scalaz.Zippers$$anon$1.findBy(Zipper.scala:269)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
at scala.Option.flatMap(Option.scala:170)
at scalaz.Bind$$anon$21.bind(Bind.scala:112)
at scalaz.Bind$$anon$21.bind(Bind.scala:111)
at scalaz.MA$class.$greater$greater$eq(MA.scala:73)
at scalaz.MAsLow$$anon$2.$greater$greater$eq(MAB.scala:50)
at scalaz.MASugar$class.$u2217(MA.scala:329)
at scalaz.MAsLow$$anon$2.$u2217(MAB.scala:50)
at scalaz.Zipper$class.findBy(Zipper.scala:178)
at scalaz.Zippers$$anon$1.findBy(Zipper.scala:269)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
at scalaz.Zipper$$anonfun$findBy$1.apply(Zipper.scala:178)
... and so on
Doing a little bit of research, and not being sure if it's relevant, I found out that Scales uses Scalaz 6.0.4, and in there, Zipper.findBy is not tailrec, while it is (at least the inner function it uses) in Scala 7. But if I change the dependency to 7.0.4 I get a lot of errors from Scales because of the changes to Iteratee from Scalaz 6 to 7 (some references not in the same place).
My questions:
- Am I overkilling all the process? Is there another much simpler approach I should be taking to tackle this task?
- If I were to continue doing this the way described, is there something I should take into consideration? Is there a way to use Scales with Scalaz 7?
Remarks:
- Strong imperative programming background, specially in Java.
- I have worked with Scala before, but in many times I have to go back to an imperative way of doing things because I get stuck (like this time) and it's time consuming.
- I have not worked with Scalaz before. My knowledge of functional programming is basic, but I'm more than happy to learn new stuff, and I like functional programming.