HDP-2.5.0.0 using Ambari 2.4.0.1
The Hive table ReportSetting is as follows :
id int
serializedreportsetting String
The column 'serializedreportsetting' is an XML data type in the source SQL Server db but is converted to String during Sqoop import, this is how it looks in SQL Server :
<ReportSettings4 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<Trigger>
<Manual>true</Manual>
</Trigger>
<StartTime>
<Year>8</Year>
<Month>1</Month>
<Day>1</Day>
<Hour>0</Hour>
<Minute>0</Minute>
</StartTime>
<ReportPeriod>
<Month>0</Month>
<Day>0</Day>
<Hour>0</Hour>
<Minute>5</Minute>
</ReportPeriod>
<Theft>
<DigitalInput>true</DigitalInput>
<Can>false</Can>
</Theft>
<SequenceNo>0</SequenceNo>
</ReportSettings4>
In Hive table :
<ReportSettings4 xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Trigger><Manual>true</Manual></Trigger><StartTime><Year>8</Year><Month>12</Month><Day>31</Day><Hour>23</Hour><Minute>34</Minute></StartTime><ReportPeriod><Month>0</Month><Day>0</Day><Hour>4</Hour><Minute>0</Minute></ReportPeriod><Theft><DigitalInput>false</DigitalInput><Can>false</Can></Theft><SequenceNo>3</SequenceNo></ReportSettings4>
The query that works fine on the SQL Server :
SELECT
r.VehicleId
,rs.value('(Trigger/Manual)[1]', 'bit') AS RS_Trigger_Manual, ,CAST(CONCAT(CASE WHEN rs.value('(StartTime/Year)[1]', 'int') < 10 THEN CONCAT('200',rs.value('(StartTime/Year)[1]', 'int')) ELSE CONCAT('20',rs.value('(StartTime/Year)[1]', 'int')) END,'-',rs.value('(StartTime/Month)[1]', 'int'),'-',rs.value('(StartTime/Day)[1]', 'int'),' ',rs.value('(StartTime/Hour)[1]', 'int'),':',rs.value('(StartTime/Minute)[1]', 'int'),':','00.000') AS datetime) AS RS_StartTime
,rs.value('(ReportPeriod/Month)[1]', 'int') AS RS_ReportPeriod_Month
,rs.value('(ReportPeriod/Day)[1]', 'int') AS RS_ReportPeriod_Day
,rs.value('(ReportPeriod/Hour)[1]', 'int') AS RS_ReportPeriod_Hour
,rs.value('(ReportPeriod/Minute)[1]', 'int') AS RS_ReportPeriod_Minute
,rs.value('(Theft/DigitalInput)[1]', 'bit') AS RS_Theft_DigitalInput
,rs.value('(Theft/Can)[1]', 'bit') AS RS_Theft_Can,rs.value('(SequenceNo)[1]', 'int')
AS RS_SequenceNo FROM ReportSetting r
CROSS APPLY SerializedReportSetting.nodes('/*') AS ReportSettings(rs)
I could think of/do the following :
- To use the CROSS APPLY, I guess lateral view needs to be used, here I don't have the serializedreportsetting as an array, so explode() won't work. Can someone validate if I'm thinking in the right direction
I simply tried to get the data in serializedreportsetting as columns using the built-in xpath udf, however, I don't get any records, few trials are as follows :
select xpath(SerializedReportSetting,'/*') from ReportSetting limit 1;
select xpath(SerializedReportSetting,'/ReportSettings4') from ReportSetting limit 1;
select xpath(SerializedReportSetting,'/Trigger/Manual') from ReportSetting limit 1;
**********UPDATE-1**********
I used the regexp_replace to handle the above challenge :
SELECT id,
xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/Manual') AS RS_Trigger_Manual,
xpath_string(SerializedReportSetting,'/ReportSettings/Trigger/DriveChange') AS RS_Trigger_DriveChange
FROM
(SELECT id,
regexp_replace(SerializedReportSetting, 'ReportSettings+\\d','ReportSettings') AS SerializedReportSetting
FROM reportsetting
WHERE id IN (1701548,3185,1700231,1700232)
) reportsetting_regex;