1

I am working on an xml that has the structure like below.

I am trying to access tag 2.1.1 and its child attributes. So, I have given root tag as tag2 and rowtag as tag 2.1.1. The below code is returning null. If I apply the same logic to tag1, it is working fine. What am i missing here?

   <root>
    <tag1>
     <tag 1.1>a</tag 1.1>
     <tag 1.2>b</tag 1.2>
    </tag1>
    <tag2>
     <tag 2.1>
      <tag 2.1.1>
        <---Multiple tags--->
      </tag 2.1.1>         
     </tag 2.1>
     <tag 2.2>
        <---multiple tags---->
     </tag 2.2> 
    </tag2>
   </root>

df = sqlContext.read.format('com.databricks.spark.xml')\
.options(rootTag='tag2',rowTag='tag 2.1.1') \
.load('s3://xmlpath')
sakthi srinivas
  • 182
  • 1
  • 4
  • 12

1 Answers1

0

tag1 is working because you have inside whereas has and so both tag1 and tag2 are not the same.

Try with below

df = sqlContext.read.format('com.databricks.spark.xml')\
.options(rootTag='tag2',rowTag='tag 2.1') \
.load('s3://xmlpath')

Does your XML tag names have the period symbol. some cases having period may not help if you have tag structure and you wanted to refer parenttag.childtag.

Thanks, Naveen

NNK
  • 1,044
  • 9
  • 24
  • I was able to read the contents with rowtag alone. The actual problem was with the schema. We used one master xml schema and applied it to other xml files. It was a int - string conversion issue(totally unrelated to the question i asked). Thanks :) – sakthi srinivas Dec 20 '18 at 13:29