0

I'm currently importing data into a hive table. When we created the table we used

CREATE EXTERNAL TABLE Customers
(
Code      string,
Company      string,
FirstName     string,
LastName     string,
DateOfBirth string,
PhoneNo     string,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';

as there are commas in our data. However, we've now discovered that the commas are still being treated as field delimiters, as well as the | we're using to separate the fields. Is there any way to work around this? Do we have to escape every single comma in our data or is there an easier way to get it set up?

Example data

1|2|3|4
a|b|c|d
John|Joe|Bob, Jr|Alex

Which when put in the table appears as

1 2 3 4
a b c d
John Joe Bob Jr

With Jr occupying its own column and bumping Alex from the table.

Robert Wertz
  • 186
  • 2
  • 14
  • Can you show the full `CREATE TABLE` statement? – Jeremy Beard Mar 27 '15 at 01:00
  • @Jeremy Updated with the full statement. Could it be because we're using `ROW FORMAT DELIMITED`? I've seen that in all the example code so I kept it figuring it just meant this is where to look for that property, but it occurs to me that maybe it could trigger the comma behavior somehow? IBM's documentation is not the best unfortunately. – Robert Wertz Mar 27 '15 at 13:17
  • The delimiter part of your DDL looks correct to me. Perhaps it's something specific to your version of BigInsights. – Jeremy Beard Mar 27 '15 at 13:52
  • BigInsights 2.1.2, Hive 0.12 – Robert Wertz Mar 27 '15 at 13:52
  • I haven't used BigInsights so I couldn't help with it sorry. – Jeremy Beard Mar 27 '15 at 13:53

1 Answers1

0

It is working fine for me using your data. Hive version is 0.13

hive> create external table foo(
    > first string,
    > second string,
    > third string,
    > forth string)
    > row format delimited fields terminated by '|' lines terminated by '\n';
OK
Time taken: 3.222 seconds
hive> load data inpath '/user/xilan/data.txt' overwrite into table foo;

hive> select third from foo;
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1422157058628_0001, Tracking URL =    http://host:8088/proxy/application_1422157058628_0001/
Kill Command = /scratch/xilan/hadoop/bin/hadoop job  -kill job_1422157058628_0001
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-03-27 07:05:41,901 Stage-1 map = 0%,  reduce = 0%
2015-03-27 07:05:50,190 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 1.24 sec
MapReduce Total cumulative CPU time: 1 seconds 240 msec
Ended Job = job_1422157058628_0001
MapReduce Jobs Launched:
Job 0: Map: 1   Cumulative CPU: 1.24 sec   HDFS Read: 245 HDFS Write: 12     SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 240 msec
OK
3
c
Bob, Jr
Time taken: 18.853 seconds, Fetched: 3 row(s)
hive>
user2018791
  • 1,143
  • 15
  • 29
  • Can you look at this in BigInsights for me to test something? Go to files and then the Catalog Table and click on the table and tell me if it works the same? We didn't check using Hive because it's millions of records, we only happened to notice when an early record showed that behavior when we viewed it through the catalog table. Wondering if it's different in BI vs Hive or if it's Hive .12 vs .13 – Robert Wertz Mar 27 '15 at 14:23
  • Sorry. I have not even heard about BigIndight and have no idea here. – user2018791 Mar 27 '15 at 15:02
  • BigInsights is IBM's version of Hadoop. No worries though, I'll see if I can find a work around. Good to know upgrading will solve it at least. – Robert Wertz Mar 27 '15 at 15:06