0

I'm using hivexmlserde-1.0.5.3 to parse XML data into Hive tables. I'm facing an issue when I'm trying to parse tags which have line breaks in them, something like this :

<item>
    <itemid>1</itemid>
    <contents subscript = "n">
        <name>Item1</name>
        <details>Line 1 with a line break. 
        Line 2 here, which is not being read.</details>
    </contents>
</item>

This is reading only first line when I'm trying to parse it using the following:

    DROP TABLE IF EXISTS db.tbl;
    CREATE EXTERNAL TABLE db.tbl  (
      ID STRING COMMENT '',
      CONTENTS ARRAY<STRUCT<
      subscript:STRING,
      contents:struct<Name:STRING,Details:STRING>>> COMMENT '') COMMENT ''
        ROW FORMAT SERDE 'com.ibm.spss.hive.serde2.xml.XmlSerDe'
        WITH SERDEPROPERTIES (
        "column.xpath.OB_CASE_ID"="/item/itemID/text()",
        "column.xpath.HISTORICAL_INTERACTION"= "/item/contents")
        STORED AS
        INPUTFORMAT 'com.ibm.spss.hive.serde2.xml.XmlInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
        LOCATION '${stagingFolderPath}'
        TBLPROPERTIES ("xmlinput.start"="<item>","xmlinput.end"="</item>");

Is there something I'm doing wrong or is there a better way to do this? Any help will be appreciated.

TIA

E_net4
  • 27,810
  • 13
  • 101
  • 139
kndarp
  • 101
  • 1
  • 3
  • 9

1 Answers1

0

I couldn't find a way to parse the data with line breaks in them. But I could remove the line breaks from the data (or you could replace it with some marker of your own). That way I was able to parse the data just as I had expected. Hope this helps. Cheers.

kndarp
  • 101
  • 1
  • 3
  • 9