2

I have table in Hive which contains column with xml string:

<?xml version="1.0"?>
<Employees>
    <Employee index="1">
        <name>Jane Doe</name>
        <age>30</age>
        <telephone>111 333 444</telephone>
        <education>University</education>
    </Employee>
    <Employee index="2">
        <name>John Doe</name>
        <age>32</age>
        <telephone>222 333 444</telephone>
        <education>High School</education>
    </Employee>
</Employees>

I can transfer this data into new table with this query in Hive without any problems:

SELECT   
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE   
LATERAL VIEW OUTER EXPLODE(XPATH(xml ,  '/Employees/Employee/@index')) lv AS  index

But when I want to create a Dataframe in PySpark with spark.sql(...) and this same query, execution ends with this error:

: org.apache.spark.sql.AnalysisException: cannot resolve 'XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education'))' due to data type mismatch: path should be a string literal;

I try to cast index as string, try to use CTE, but nothing works. Is there a way to do it through spark.sql with Hive query or with other workaround?

ZygD
  • 22,092
  • 39
  • 79
  • 102
Martin
  • 39
  • 5
  • 1
    Perhaps it means that XPath cannot be dynamically constructed, should be `XPATH_STRING(xml, '/Employees/Employee[0]/education')`, seen that before on some database. – LMC Jun 08 '22 at 16:01
  • I come to the same conclusion, that xpath cannot be dynamically constructed... Hopefully someone else will have ideas. – ZygD Jun 08 '22 at 17:10
  • Unfortunately, that is what "a string literal" means I guess. The `path` is checked for being foldable here https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/xml/xpath.scala#L43. With "foldable" defined as _Returns true when an expression is a candidate for static evaluation before the query is executed..._ (https://github.com/apache/spark/blob/branch-3.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Expression.scala#L89) – mazaneicha Jun 08 '22 at 22:31
  • Is there a way to get the same result without the error with the help of selectExpr()? – Martin Jun 09 '22 at 10:52
  • I don't think `selectExpr()` will help :( The only sure way I can see is to collect the indices upfront, then loop thru them building `xpath` for each and running a query, then union the results. `indices = spark.sql("select xml, explode(XPATH(xml, '/Employees/Employee/@index')) ii from xml_table").collect()` But maybe someone has better ideas? – mazaneicha Jun 09 '22 at 14:12

0 Answers0