1

This is an extension of a question I posed yesterday: How to handle potential data loss when performing comparisons across data types in different groups

In HIVE, is it possible to perform comparisons between two columns that are in different data type groups inline within the SELECT clause? I need to first determine what the incoming meta data is for each column and then provide logic that picks what CAST to use.

CASE 
    WHEN  Column1 <=> Column2 THEN 0 -- Error occurs here if data types are in different data type groups (i.e., BIGINT <=> STRING)
    ELSE 1
END

If this is not possible, is there a workaround to retrieve the meta data and perform CAST'ing based on certain rules? For example, if:

  1. Column1 is BIGINT
  2. Column2 is STRING

Then CAST Column2 as BIGINT and then perform the comparison. The simplest solution possible is desired. I'm hoping to not have to make substantial code changes to the underlying system that is generating the SELECT statement.

Update 2: There is a back-end system that generates the SQOOP queries and in that system there is a function that modifies the SELECT clause to compare two columns. The problem we are running into is that the columns used to be implicitly converted on an earlier version of HIVE/Cloudera that we were running. Now that we are upgrading to a version that does not support implicit data type conversion, we need to find the simplest solution to continue supporting the data type conversion. Unfortunately, there is no meta data available within the back-end, so I am trying to hack it out in the SELECT clause

J Weezy
  • 3,507
  • 3
  • 32
  • 88

1 Answers1

1

Cast implicitly is good idea:

CASE 
    WHEN  Column1 <=> cast(Column2 as bigint) THEN 0 
    ELSE 1
END

If not possible to cast, the result of cast is NULL.

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • Well, yeah. But, I need to be able to check for Column2's incoming meta data before determining which data type to `CAST` to. I have updated my question to provide more details. – J Weezy Oct 04 '19 at 17:53
  • @JWeezy Of course. If you know the column names (and you need to know them) then what is the problem with their types? How did you got known of column names without their types? – leftjoin Oct 04 '19 at 17:57
  • We use a back-end to generate the SQOOP queries. There is a function that modifies certain rows in the `SELECT` clause. It is this function where I would like to make the change to. I have updated my question. – J Weezy Oct 04 '19 at 17:59
  • @JWeezy What if you cast both columns to STRING always. It can be a solution to your problem – leftjoin Oct 04 '19 at 18:02
  • that was what my original question (see link at the top of the question) was asking. But, the answer that I received pointed out that there could be loss in the conversion process, which would break the comparison. – J Weezy Oct 04 '19 at 18:04
  • @JWeezy Yep. agree completely. It will not work with floating point types – leftjoin Oct 04 '19 at 18:07
  • If there are only non-floating point types (e.g., string, date, int, bigint) would conversion to string work? – J Weezy Oct 04 '19 at 18:09
  • @JWeezy except string represented timestamps. For example two strings '2019-01-01 12:10:01.0' and '2019-01-01 12:10:01.0000' when casted to TIMESTAMP will result in the same timestamp, though strings are different. Timestamp also has similar precision – leftjoin Oct 04 '19 at 18:19
  • That was helpful. Thank you. – J Weezy Oct 04 '19 at 19:05