0

We have SQL Server 2016 with a varbinary column that contains compressed XML. Now we want to load data into cdp hive (Hive 3.1.3000) table BY DECOMPRESSING it.

Initially we were using java utility for decompressing and inflating data, but now we are looking for some alternate approach like pyspark.

We were using below java code to inflate data:

if( colType == java.sql.Types.VARBINARY ) {
msg = "Processing VARBINARY " + colLabel;
// logger.info("Checking VARBINARY column: " + colLabel);
if( inflateColumnList.contains(colLabel) ) {
ByteArrayInputStream bais = new ByteArrayInputStream( rs.getBytes( colIndex ));
Inflater inflater = new Inflater(true);
InflaterInputStream iis = new InflaterInputStream(bais, inflater); 
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while(iis.available() != 0){
buffer.write(iis.read());
}
iis.close();
result = new String(buffer.toByteArray(), "UTF-8" );
}
else {
logger.info(" VARBINARY column: " + colLabel + " is NOT in the unzip list");
result = Base64.getEncoder().encodeToString(rs.getBytes(colIndex) );
}

I am at point where I can fetch bytearray from dataframe as below:

enter image description here

bytearrayobj = df.select(F.collect_list('itemdetailsdata')).first()[0][0]
print(zlib.decompress(bytes.decode(bytearrayobj,'utf-8')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'decode' requires a 'str' object but received a 'bytearray'

Please guide me what I need to proceed with in order to generate decompressed XML from this bytearray.

  • 1
    Question the first: what is it compressed *with*? SQL Server offers compressed storage, but that's transparent to clients. For everything else, you will have applied some particular algorithm for the compression. – Jeroen Mostert Dec 09 '21 at 10:35
  • @JeroenMostert We have update code which we use to inflate the varbinary column. HTH – Anshuman Srivastava Dec 09 '21 at 10:57
  • 1
    Right, that's a zlib-compressed, UTF-8 encoded string, so [`zlib.decompress`](https://docs.python.org/3/library/zlib.html) and [`bytes.decode`](https://docs.python.org/3/library/stdtypes.html#bytes.decode) should see you off. – Jeroen Mostert Dec 09 '21 at 11:01
  • @JeroenMostert tried both but getting error "TypeError: descriptor 'decode' requires a 'str' object but received a 'bytearray'" – Anshuman Srivastava Dec 09 '21 at 13:18
  • 2
    Switch the operations. First you decompress, then you decode. – Jeroen Mostert Dec 09 '21 at 13:26
  • ```print(bytes.decode(zlib.decompress(bytearrayobj),'utf-8')) Traceback (most recent call last): File "", line 1, in TypeError: must be string or read-only buffer, not bytearray ``` – Anshuman Srivastava Dec 10 '21 at 05:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/240002/discussion-between-anshuman-srivastava-and-jeroen-mostert). – Anshuman Srivastava Dec 10 '21 at 08:05

0 Answers0