0

HDP-2.5.0.0 using Ambari 2.4.0.1

The Hive table ReportSetting is as follows :

id int

serializedreportsetting String

The column 'serializedreportsetting' is an XML data type in the source SQL Server db but is converted to String during Sqoop import, this is how it looks in SQL Server :

<ReportSettings4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <Trigger>
  <Manual>true</Manual>
  </Trigger>
<StartTime>
    <Year>8</Year>
    <Month>1</Month>
    <Day>1</Day>
    <Hour>0</Hour>
    <Minute>0</Minute>
  </StartTime>
  <ReportPeriod>
    <Month>0</Month>
    <Day>0</Day>
    <Hour>0</Hour>
    <Minute>5</Minute>
  </ReportPeriod>
  <Theft>
    <DigitalInput>true</DigitalInput>
    <Can>false</Can>
  </Theft>
  <SequenceNo>0</SequenceNo>
</ReportSettings4>

In Hive table :

<ReportSettings4 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Trigger><Manual>true</Manual></Trigger><StartTime><Year>8</Year><Month>12</Month><Day>31</Day><Hour>23</Hour><Minute>34</Minute></StartTime><ReportPeriod><Month>0</Month><Day>0</Day><Hour>4</Hour><Minute>0</Minute></ReportPeriod><Theft><DigitalInput>false</DigitalInput><Can>false</Can></Theft><SequenceNo>3</SequenceNo></ReportSettings4>

The query that works fine on the SQL Server :

SELECT
r.VehicleId
,rs.value('(Trigger/Manual)[1]', 'bit') AS RS_Trigger_Manual, ,CAST(CONCAT(CASE WHEN rs.value('(StartTime/Year)[1]', 'int') < 10 THEN CONCAT('200',rs.value('(StartTime/Year)[1]', 'int')) ELSE CONCAT('20',rs.value('(StartTime/Year)[1]', 'int')) END,'-',rs.value('(StartTime/Month)[1]', 'int'),'-',rs.value('(StartTime/Day)[1]', 'int'),' ',rs.value('(StartTime/Hour)[1]', 'int'),':',rs.value('(StartTime/Minute)[1]', 'int'),':','00.000') AS datetime) AS RS_StartTime
,rs.value('(ReportPeriod/Month)[1]', 'int') AS RS_ReportPeriod_Month
,rs.value('(ReportPeriod/Day)[1]', 'int') AS RS_ReportPeriod_Day
,rs.value('(ReportPeriod/Hour)[1]', 'int') AS RS_ReportPeriod_Hour
,rs.value('(ReportPeriod/Minute)[1]', 'int') AS RS_ReportPeriod_Minute
,rs.value('(Theft/DigitalInput)[1]', 'bit') AS RS_Theft_DigitalInput
,rs.value('(Theft/Can)[1]', 'bit') AS RS_Theft_Can,rs.value('(SequenceNo)[1]', 'int') 

AS RS_SequenceNo FROM ReportSetting r
  CROSS APPLY SerializedReportSetting.nodes('/*') AS ReportSettings(rs)

I could think of/do the following :

  1. To use the CROSS APPLY, I guess lateral view needs to be used, here I don't have the serializedreportsetting as an array, so explode() won't work. Can someone validate if I'm thinking in the right direction
  2. I simply tried to get the data in serializedreportsetting as columns using the built-in xpath udf, however, I don't get any records, few trials are as follows :

    select xpath(SerializedReportSetting,'/*') from ReportSetting limit 1;

    select xpath(SerializedReportSetting,'/ReportSettings4') from ReportSetting limit 1;

    select xpath(SerializedReportSetting,'/Trigger/Manual') from ReportSetting limit 1;

**********UPDATE-1**********

I used the regexp_replace to handle the above challenge :

SELECT id,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/Manual')        AS RS_Trigger_Manual,
  xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/DriveChange')   AS RS_Trigger_DriveChange
FROM
  (SELECT id,
    regexp_replace(SerializedReportSetting, 'ReportSettings+\\d','ReportSettings') AS SerializedReportSetting
  FROM reportsetting
  WHERE id IN (1701548,3185,1700231,1700232)
  ) reportsetting_regex;
Kaliyug Antagonist
  • 3,512
  • 9
  • 51
  • 103

1 Answers1

1

In xpath they explicitly say:

The xpath() function always returns a hive array of strings. If the expression results in a non-text value (e.g., another xml node) the function will return an empty array

So you can either use: xpath(SerializedReportSetting,'/ReportSettings4/Trigg‌​er/Manual/text()') from ReportSetting limit 1;

Or an even better option is to use xpath_boolean/xpath_int:

xpath_boolean - Returns true if the XPath expression evaluates to true, or if a matching node is found.

xpath_boolean(SerializedReportSetting,'/ReportSettings4/Trigg‌​er/Manual') from ReportSetting limit 1;

xpath_short, xpath_int, xpath_long These functions return an integer numeric value, or the value zero if no match is found, or a match is found but the value is non-numeric. Mathematical operations are supported. In cases where the value overflows the return type, then the maximum value for the type is returned.

xpath_int(SerializedReportSetting,'/ReportSettings4/ReportPeriod/Month') from ReportSetting limit 1;

Alex Libov
  • 1,481
  • 2
  • 11
  • 20
  • Is it possible to write a generic xpath expression that can handle 'regular expressions' because SerializedReportSetting can have ReportSettings, ReportSettings1, ReportSettings2, ReportSettings3 and ReportSettings4 ? – Kaliyug Antagonist Nov 03 '16 at 11:59
  • It's unclear from the xpath documentation whether something like `xpath_int(SerializedReportSetting,'/ReportSettings*/ReportPeriod/Month')` would work – Alex Libov Nov 03 '16 at 12:05
  • 1
    I thought of a workaround. You could do: `xpath_int(regexp_replace(SerializedReportSetting, 'ReportSettings[^ ]*','ReportSettings'),'/ReportSettings/ReportPeriod/Month')`.that would remove the number from the xml before calling `xpath_int` – Alex Libov Nov 03 '16 at 12:06
  • /ReportSettings* doesn't work but I used the regexp_replace to get it working – Kaliyug Antagonist Nov 04 '16 at 09:50
  • /ReportSettings* doesn't work but I used the regexp_replace to get it working, updated my question(see 'Update-1'). Can you check if , logically, it's a correct way ? – Kaliyug Antagonist Nov 04 '16 at 09:56
  • @KaliyugAntagonist I think it should be 'ReportSettings\\d+'.. otherwise looks okay – Alex Libov Nov 04 '16 at 10:47