0

Can anyone let me know if there is any way of having a global variable in Hive UDF?

I am trying to find out a solution of the below problem. Scenario would be as below.I have three types of file

  1. A file with 4 columns (Lets assume column names are A, B, C, and D)
  2. A file with 2 columns (B, D)
  3. A file with 2 columns (B, C)

I will convert all three files into a standard format (File 1 format - an output with 4 columns). To convert into standard format I need to refer the header record present in the first line of the file. So if my input file is 256MB and multiple mappers gets invoked, is there any way such that each mapper can refer a global variable (Header information).
In short is there a way to have a common variable for all the mappers that get invoked my Hive UDF ?

Note: The UDF will run on a single column table there by reading the complete row and then writing it to next tables HDFS location.

Garfield
  • 396
  • 6
  • 19

1 Answers1

0

Yes there is a way to do this, and I've done it myself.

The best way is to find the information BEFORE you start your map-reduce job by reading the file, then you can set a configuration value for the Mappers and Reducers to use.

So for example you'd do something like this (pseudo-scala) before launching your job in your main method:

// assume c = Configuration()

val headerInformationJson = getHeaderInformation(filePath1)
c.set("headerInfo", headerInformationJson)

Then in the initialize method of your mappers you can read this back out:

val conf = context.getConfiguration()
val headerInfo = conf.get("headerInfo");
Matthew Rathbone
  • 8,144
  • 7
  • 49
  • 79