2

I have XML that I'm trying to use Scala XML API. I have XPath queries to retrieve the data from the XML tags. I want to retrieve <price> tag value from <market> but using the two attributes _id and type. I want to write a condition with && so that I'll get a unique value for each price tag, e.g. where MARKET _ID = 1 && TYPE = "A".

For reference find XML below:

<publisher>
    <book _id = "0"> 
        <author _id="0">Dev</author>
        <publish_date>24 Feb 1995</publish_date>
        <description>Data Structure - C</description>
        <market _id="0" type="A">
            <price>45.95</price>            
        </market>
        <market _id="0" type="B">
            <price>55.95</price>
        </market>
    </book>
    <book _id="1"> 
        <author _id = "1">Ram</author>
        <publish_date>02 Jul 1999</publish_date>
        <description>Data Structure - Java</description>
        <market _id="1" type="A">
            <price>145.95</price>           
        </market>   
        <market _id="1" type="B">
            <price>155.95</price>           
        </market>
    </book>
</publisher>

The following code is working fine

import scala.xml._

object XMLtoCSV extends App {

  val xmlLoad = XML.loadFile("C:/Users/sharprao/Desktop/FirstTry.xml")  

  val price = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "0")}) \ "market" filter { _ \ "@_id" exists (_.text == "0")}) \ "price").text  //45.95
  val price1 = (((xmlLoad \ "book" filter { _ \ "@_id" exists (_.text == "1")}) \ "market" filter { _ \ "@_id" exists (_.text == "1")}) \ "price").text  //155.95

  println("price = " + price)
  println("price1 = " + price1)
} 

The output is:

price = 45.9555.95
price1 = 145.95155.95

My above code is giving me both the values as I'm not able to put && conditions.

  1. Please advice other than filter what SCALA function I can use.
  2. Also let me know how to get the all attribute names.
  3. If possible please let me know from where I can read all these APIs.

Thanks in Advance.

ashawley
  • 4,195
  • 1
  • 27
  • 40
Pardeep Sharma
  • 572
  • 5
  • 20

3 Answers3

2

You could write a custom predicate to check multiple attributes:

def checkMarket(marketId: String, marketType: String)(node: Node): Boolean = {
  node.attribute("_id").exists(_.text == marketId) &&
  node.attribute("type").exists(_.text == marketType)
}

Then use it as a filter:

val price1 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "0"))) \ "market" filter checkMarket("0", "A")) \ "price").text
// 45.95

val price2 = (((xmlLoad \ "book" filter (_ \ "@_id" exists (_.text == "1"))) \ "market" filter checkMarket("1", "B")) \ "price").text
// 155.95
Jeffrey Chung
  • 19,319
  • 8
  • 34
  • 54
  • I appreciate your solution, but without writing function can we do it - is there any SCALA function which can fit in this scenario. – Pardeep Sharma Aug 18 '17 at 06:01
  • 1
    One more thing, I have shared a sample xml with you. But my xml is very big. Almost 200 tags that means I have to write 200 functions, because attributes are different for different tags from one to six different attribute. I think I have to write 6 functions and have to change the parameter. – Pardeep Sharma Aug 18 '17 at 06:17
  • @PardeepSharma Ask another question with a sample of some of the tags. – ashawley Aug 25 '17 at 21:02
1

This would be the way to write it if you are interested in getting a CSV file of your data:

(xmlload \ "book").flatMap { bk =>
  (bk \ "market").flatMap { mkt =>
    (mkt \ "price").map { p =>
      Seq(
        bk \@ "_id",
        mkt \@ "_id",
        mkt \@ "type",
        p.text.toFloat
      )
    }
  }
}.map { cols =>
  cols.mkString("\t")
}.foreach { 
  println
}

It will output the following:

0       0       A       45.95
0       0       B       55.95
1       1       A       145.95
1       1       B       155.95

And a common pattern to recognize when writing Scala: Is that most flatMap flatMap ... map can be rewritten to for-comprehensions:

for {
    book <- xmlload \ "book"
    market <- book \ "market"
    price <- market \ "price"
} yield {
  val cols = Seq(
    book \@ "_id",
    market \@ "_id",
    market \@ "type",
    price.text.toFloat
  )
  println(cols.mkString("\t"))
}
ashawley
  • 4,195
  • 1
  • 27
  • 40
-1

I used Spark and with hiveContext I was able to parse the xPath.

object xPathReader extends App{

    System.setProperty("hadoop.home.dir","D:\\IBM\\DB\\Hadoop\\winutils")   // Path for my winutils.exe

    val sparkConf = new SparkConf().setAppName("XMLParcing").setMaster("local[2]")
    val sc = new SparkContext(sparkConf)
    val hiveContext = new HiveContext(sc)
    val myXmlPath = "D:\\IBM\\DB\\xml"
    val xmlRDDList = XmlFileUtil.withCharset(sc, myXmlPath, "UTF-8", "publisher") //XmlFileUtil - this is a private class in scala hence I created a Java class to use it.

    import hiveContext.implicits._

    val xmlDf = xmlRDDList.toDF("tempXMLTable")
    xmlDf.registerTempTable("tempTable")

    hiveContext.sql("select xpath_string(tempXMLTable,\"/book/@_id\") as BookId, xpath_float(tempXMLTable,\"/book/market[@_id='1' and @type='B']/price\") as Price from tempTable").show()      

    /*  Output
        +------+------+
        |BookId| Price|
        +------+------+
        |     0| 55.95|
        |     1|155.95|
        +------+------+
    */
}
Bhargav Rao
  • 50,140
  • 28
  • 121
  • 140
Pardeep Sharma
  • 572
  • 5
  • 20
  • This had nothing to do with the original question which was about parsing the XML with scala-xml, not XPath in Spark. – ashawley Aug 25 '17 at 21:01
  • I have provided an alternative, I didn't say this is an answer for my solution. – Pardeep Sharma Aug 26 '17 at 12:15
  • Because XmlFile.withCharset was private object we were not able to use it hence I have implemented xmlFileUtil. public class XmlFileUtil { public static RDD withCharset(SparkContext context, String location, String charset, String rowTag) { return XmlFile.withCharset(context, location, charset, rowTag); } } – Pardeep Sharma Aug 28 '17 at 14:09
  • 1
    Interesting, you should ask a new question about that – ashawley Aug 28 '17 at 16:15
  • 1
    thank you @ashawley - I just wanna share another approach. Sure I'll ask another question and put these comments there. – Pardeep Sharma Aug 29 '17 at 07:09