0

I have a column names ViolationTime in my Hive table. It contains time in 24-hour HHmm format, for example 1424. The table contains 10 million rows. I want to divide it into 6 discrete groups to perform operations.

I tried using ntile, but it will divide the values based on ascending or descending order. I'd like this column to be divided in discrete intervals.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52

1 Answers1

0

In Hive 3.0 and newer, the width_bucket() function does that:

Return type: INT
Name (signature): width_bucket(NUMERIC expr, NUMERIC min_value, NUMERIC max_value, INT num_buckets)
Description: Returns an integer between 0 and num_buckets+1 by mapping expr into the ith equally sized bucket. Buckets are made by dividing [min_value, max_value] into equally sized regions. If expr < min_value, return 1, if expr > max_value return num_buckets+1.

Although you may find that you need to convert your HHmm time values to INTs first (e.g. number of seconds since midnight), to make it work perfectly well.

mazaneicha
  • 8,794
  • 4
  • 33
  • 52