I have table in Hive which contains column with xml string:
<?xml version="1.0"?>
<Employees>
<Employee index="1">
<name>Jane Doe</name>
<age>30</age>
<telephone>111 333 444</telephone>
<education>University</education>
</Employee>
<Employee index="2">
<name>John Doe</name>
<age>32</age>
<telephone>222 333 444</telephone>
<education>High School</education>
</Employee>
</Employees>
I can transfer this data into new table with this query in Hive without any problems:
SELECT
index,
XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education')) AS education,
XPATH_NUMBER(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/age')) AS age
FROM DB.XML_TABLE
LATERAL VIEW OUTER EXPLODE(XPATH(xml , '/Employees/Employee/@index')) lv AS index
But when I want to create a Dataframe in PySpark with spark.sql(...) and this same query, execution ends with this error:
: org.apache.spark.sql.AnalysisException: cannot resolve 'XPATH_STRING(xml, CONCAT('/Employees/Employee[@index="',Index,'"]/education'))' due to data type mismatch: path should be a string literal;
I try to cast index as string, try to use CTE, but nothing works. Is there a way to do it through spark.sql with Hive query or with other workaround?