Why is DataStage writing NULL string values as empty strings, while other data types correctly have NULL values

Question

I have a DataStage parallel job that writes to Hive as the final stage in a long job. I can view the data that is about to be written and there are many NULL strings that I want to see in the Hive table.

However, when I view the table that is created, there are no NULL strings, they all get converted into empty strings '' instead. I can see other datatypes, like DECIMAL(5,0) have NULL values and I can select these, e.g.

SELECT * FROM mytable WHERE decimal_column IS NULL;

The process for writing to Hive is to store the data in a staging table in a delimited text format. This is then pushed through a generic CDC process and results in data being written to a new partition in an ORC format table.

The only option I can see for handling NULL values is "Null Value" in the HDFS File Connector Stage. If I leave this blank then I get empty strings and if I type in 'NULL' then 'NULL' is what I get, i.e. not a NULL, but the string 'NULL'.

I can't change the process as it's in place for literally thousands of jobs already. Is there any way to get my string values to be NULL or am I stuck with empty strings?

score 0 · Answer 1 · answered Jul 20 '22 at 10:33

0

According to the IBM documentation, an empty String in double-quotation "" should help.

Null value

Specify the character or string that represents null values in the data. For a source stage, input data that has the value that you specify is set to null on the output link. For a target stage, in the output file that is written to the file system, null values are represented by the value that is specified for this property. To specify that an empty string represents a null value, specify "" (two double quotation marks).

Source: https://www.ibm.com/docs/en/iis/11.7?topic=reference-properties-file-connector

answered Jul 20 '22 at 10:33

Justus Kenklies

440
3
10

I tested this and it doesn't appear to do anything. I think the default behaviour is to store a NULL string as an empty string, i.e. "" and all this does is to confirm that behaviour. I want a NULL string to be stored as a NULL value, like it already does for dates, decimals, etc. – Richard Hansell Jul 20 '22 at 13:45
Did you try putting something like `If Len(Trim(mylink.mystring1)) = 0 Then SetNull() Else mylink.mystring1` in a transformer before the output stage, so the stage gets not a Null String but a NUll value? – Justus Kenklies Jul 21 '22 at 21:16
Yes, it didn't work. If I write the final output to a Data Set then I see NULLs for strings. It's only when I write it to HDFS that it changes NULLs to empty strings. I can "fix" it with an HQL script, but I wondered why the default behaviour was to do this and if it was possible to change it – Richard Hansell Jul 22 '22 at 08:06
Hmm... then I suggest to open a support case at IBM. Sorry I couldn't help. – Justus Kenklies Jul 23 '22 at 17:19

Why is DataStage writing NULL string values as empty strings, while other data types correctly have NULL values

1 Answers1