4

I have one CSV file and metadata for the same. Columns in this CSV is are delimited by pipe | symbol. Sample data is as follows:

name|address|age|salary|doj
xyz | abcdef|29 |567,34|12/02/2001

Here salary column is of type decimal but instead of using period . as decimal separator, comma , is used.

I created Hive external table as below and for this data Hive shows NULL for salary column.

create external table employee as(
      name string,
      address string,
      age int,
      salary decimal(7,3),
      doj string
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LOCATION 's3://bucket/folder_having_many_csv_files/';

If I change data type of salary column to String then as expected, Hive works fine.

I would like to know how to tell Hive that this particular column is of type DECIMAL and decimal separator is comma (,) and not a period (.) symbol.

Shekhar
  • 11,438
  • 36
  • 130
  • 186
  • 1
    Any reason you can't just replace the comma separators in the file with decimals? – Andrew Dec 20 '16 at 16:01
  • These files are huge, dumped by other team. I can write some map-reduce program or Spark to replace comma separators but before doing that I wanted to check if there is something in Hive which I can use. – Shekhar Dec 20 '16 at 16:14

1 Answers1

6

You could easily build table with salary as a string and replace the comma in a view on top. This is probably the easiest thing to do since the data is big and likely someone else owns it.

create view table employee_decimal as
 select name
  , address
  , age
  , cast(regexp_replace(salary, ',', '.') as decimal(7,3)) as salary
  , doj
 from employee;
invoketheshell
  • 3,819
  • 2
  • 20
  • 35