0

I am new to Hive so please be gentle if my question is noobies :-)

I use the following hive statement to create and load data into a table.

CREATE TABLE entities_extract (doc_id STRING, name STRING, type STRING, len STRING, offset    STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/research/45924/hive/entities_extract';

LOAD DATA LOCAL INPATH '/home/researcher/hadoop-runnables/files/entitie_extract_by_doc.txt' OVERWRITE INTO TABLE entities_extract;

Oke so far so good, there are no errors when I execute this script. The weird thing is that when I do a select * from on the table my result shows 4 extra columns with null values

The data that goes in looks like below:

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Chanko   PERSON   6   41086

The data that returns form the select looks like this:

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Chanko   PERSON   6   41086   NULL    NULL    NULL    NULL

EDIT: Below a small subset of "entitie_extract_by_doc.txt"

USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Berkowitz   PERSON   9   385
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Marotolli   PERSON   939420
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Corzatt   PERSON   7   39772
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Berkowitz   PERSON   9  40314
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Corzatt   PERSON   7   40584
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Berkowitz   PERSON   9  40840
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Rich   PERSON   4   41038
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Lea   PERSON   3   41044
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Anthony   PERSON   7   41049
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Bill   PERSON   4   41062
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Nelson   PERSON   6   41067
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Barbara   PERSON   7   41078
USER.A-GovDocs-f83c6ca3-9585-4c66-b9b0-f4c3bd57ccf4   Chanko   PERSON   6   41086

I already look at my source data to see if there were 4 extra tabs, but that was not the case..

Anyone here has any idea where these 4 extra columns come from?

Kind regards,

Martijn

Tinuz
  • 1
  • 1
  • 3
  • Can you give a small sample of entitie_extract_by_doc.txt ? – Charles Menguy Jan 30 '13 at 16:31
  • Provided a small subset of entitie_extract_by_doc.txt. It seems to me that during the load of the txt file the entire line get put into doc_id and not divided across the five columns. – Tinuz Jan 30 '13 at 20:52
  • I got it working, seems that os x writes tabs differently then Debian does..dump! – Tinuz Jan 30 '13 at 22:14

2 Answers2

0

You should replace any \n that could be inside your string data that is part of the same column. Those extra \n could be creating extra columns.

select regexp_replace(mensaje,"\n"," ") from partido where id = "1e2af";
perror
  • 7,071
  • 16
  • 58
  • 85
Jipo
  • 111
  • 2
  • 5
0

Here you not creating external table , so not need to mention location. Remove location from the query , then you get correct values.