8

Hi actually the problem is as follows the data i want to insert in hive table has latin words and its in utf-8 encoded format. But still hive does not display it properly.

Actual Data:- Actual Data

Data Inserted in hive

Hive Data

I changed the encoding of the table to utf-8 as well still same issue below are the hive DDL and commands

CREATE TABLE IF NOT EXISTS test6
(
CONTACT_RECORD_ID    string,
ACCOUNT    string,
CUST    string,
NUMBER    string,
NUMBER1    string,
NUMBER2    string,
NUMBER3    string,
NUMBER4    string,
NUMBER5    string,
NUMBER6    string,
NUMBER7    string,
LIST    string
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '|';
ALTER TABLE test6 SET serdeproperties ('serialization.encoding'='UTF-8');

Does hive support only the first 128 characters of UTF-8? Please do suggest.

Chetan Pulate
  • 503
  • 1
  • 7
  • 21
  • _"hive does not display it properly"_ -- did you make sure it's not a *display* artifact, indeed? Did you enforce `export LANG=en_US.UTF-8` and checked that your terminal app expects UTF-8 (e.g. with PuTTY, _Window > Translation > Remote charset = UTF-8_)? – Samson Scharfrichter Apr 04 '17 at 18:27
  • Also, did you download one of the HDFS files and run `file` command on it, just to make sure it actually detects UTF-8 content? – Samson Scharfrichter Apr 04 '17 at 18:30
  • Did you find any solution @cheta Pulate? If yes please mention here Thanks. – Nauman Khan Apr 19 '18 at 12:52

2 Answers2

4

this may not be ideal solution , but this works. Hive somehow doesn't seem to treat them as UTF8. Please try to create the table with following parameters:

CREATE TABLE testjoins.yt_sample_mapping_1(
   `col1` string,
   `col2` string,
   `col3` string)
   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe"
   WITH SERDEPROPERTIES ( "separatorChar" = ",", 
    "quoteChar" = "\"", 
    "escapeChar" = "\\", 
    "serialization.encoding"='ISO-8859-1') 
    TBLPROPERTIES ( 'store.charset'='ISO-8859-1', 
    'retrieve.charset'='ISO-8859-1');
BalaramRaju
  • 439
  • 2
  • 8
2

For me adding following line worked.

TBLPROPERTIES('serialization.encoding'='windows-1252')

Example code:

CREATE EXTERNAL TABLE IF NOT EXISTS test.tbl
(
    name string,
    gender string,
    age string,
    address string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n' STORED AS TEXTFILE
LOCATION 'adl://<Data-Lake-Store>.azuredatalakestore.net/<Folder-Name>/'
TBLPROPERTIES('serialization.encoding'='windows-1252');
Tokci
  • 1,220
  • 1
  • 23
  • 31