1

I have created an external table in Hive using following:

create external table hpd_txt(
WbanNum INT,
YearMonthDay INT , 
Time INT, 
HourlyPrecip INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/user/hive/external';

Now this table is created in location */hive/external.

Step-1: I loaded data in this table using:

load data inpath '/input/hpd.txt' into table hpd_txt;

the data is successfully loaded in the specified path ( */external/hpd_txt)

Step-2: I delete the table from */hive/external path using following:

hadoop fs -rmr /user/hive/external/hpd_txt

Questions:

  1. why is the table deleted from original path? (*/input/hpd.txt is deleted from hdfs but table is created in */external path)
  2. After I delete the table from HDFS as in step 2, and again I use show tables; It still gives the table hpd_txt in the external path. so where is this coming from.

Thanks in advance.

YoungHobbit
  • 13,254
  • 9
  • 50
  • 73
Anoop Mamgain
  • 187
  • 2
  • 3
  • 13

4 Answers4

1

Hive doesn't know that you deleted the files. Hive still expects to find the files in the location you specified. You can do whatever you want in HDFS and this doesn't get communicated to hive. You have to tell hive if things change.

hadoop fs -rmr /user/hive/external/hpd_txt

For instance the above command doesn't delete the table it just removes the file. The table still exists in hive metastore. If you want to delete the table then use:

drop if exists tablename;

Since you created the table as an external table this will drop the table from hive. The files will remain if you haven't removed them. If you want to delete an external table and the files the table is reading from you can do one of the following:

  1. Drop the table and then remove the files
  2. Change the table to managed and drop the table

Finally the location of the metastore for hive is by default located here /usr/hive/warehouse.

invoketheshell
  • 3,819
  • 2
  • 20
  • 35
  • Thanks invoketheshell for clarifying! One more thing I want to know is that , why does the file in HDFS dat is being loaded to the table removed from its path where it was stored initially using -copyFromLocal command? – Anoop Mamgain Sep 05 '15 at 18:35
  • -copyFromLocal is the same as a -put I believe. It should copy the file and place it in the specified destination inside HDFS. – invoketheshell Sep 08 '15 at 15:31
  • Actually I observed one thing. When I create a table at a location say /dir......and then load the data to the table ( load data inpath....) , the file is no longer in the earlier path (one where copied to), but is now present in the /dir path. Is it because the data is still in hdfs but the path is just being changed? – Anoop Mamgain Sep 08 '15 at 16:31
  • Sorry it is not clear to me what you are asking here. – invoketheshell Sep 08 '15 at 16:39
  • Ideally source file should not be removed. I cannot replicate your scenario. I see table in both normal and hdfs file system. – Ravindra babu Sep 09 '15 at 04:19
  • @sunrise76 -- sorry if I wasnt clear. Lemme try once again. Step 1- I loaded a text file in HDFS using -copyFromLocal ( lets say txt_file.txt) to path -----> /input/txt_file.txt Step 2- I created an external table "ext_tble" using HIVE at location ---> /user/hive/external Step 3- I used the query--- " load data inpath '/input/txt_file.txt' into table ext_tble; " to load the table with the data. Now if i go to /input/ path, the txt_file is not there. And when I navigate to /user/hive/external/ path, the txt_file is present there. Not sure what am I doing wrong. plz hlp – Anoop Mamgain Sep 09 '15 at 04:53
  • Since you already have HDFS file location, do not use load data inpath. Directly provide hdfs file location in create table. – Ravindra babu Sep 09 '15 at 05:43
  • Have a look at : http://stackoverflow.com/questions/7567605/how-to-load-data-to-hive-from-hdfs-without-removing-the-source-file – Ravindra babu Sep 09 '15 at 05:52
0

The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. This comes is handy if you already have data generated. Else, you will have data loaded (conventionally or by creating a file in the directory being pointed by the hive table) When dropping an EXTERNAL table, data in the table is NOT deleted from the file system. An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

Source: Hive docs

So, in your step 2, removing the file /user/hive/external/hpd_txt removes the data source(data pointing to the table) but the table still exists and would continue to point to hdfs://localhost:9000/user/hive/external as it was created

@Anoop : Not sure if this answers your question. let me know if you have any questions further.

Anil
  • 420
  • 2
  • 16
  • Hi Anil. Thanks for replying. Yes it helped me clear a doubt. But the main doubt is still pending. ( I have mentioned that as a comment on above answer) . Please help :) – Anoop Mamgain Sep 09 '15 at 04:55
  • @Anoop May I know what exactly your concern is... or may be you can just include it as part of your question. – Anil Sep 09 '15 at 14:07
0
  1. Do not use load path command. The Load operation is used to MOVE ( not COPY) the data into corresponding Hive table. Use put Or copyFromLocal to copy file from non HDFS format to HDFS format. Just provide HDFS file location in create table after execution of put command.

  2. Deleting a table does not remove HDFS file from disk. That is the advantage of external table. Hive tables just stores metadata to access data files. Hive tables store actual data of data file in HIVE tables. If you drop the table, the data file is untouched in HDFS file location. But in case of internal tables, both metadata and data will be removed if you drop table.

Ravindra babu
  • 37,698
  • 11
  • 250
  • 211
0

After going through you helping comments and other posts, I have found answer to my question. If I use LOAD INPATH command then it "moves" the source file to the location where external table is being created. Which although, wont be affected in case of dropping the table, but changing the location is not good. So use local inpath in case of loading data in Internal tables .

To load data in external tables from a file located in the HDFS, use the location in the CREATE table query which will point to the source file, for example:

create external table hpd(WbanNum string,
YearMonthDay string , 
Time string, 
hourprecip string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
stored as textfile
location 'hdfs://localhost:9000/input/hpd/';

So this sample location will point to the data already present in HDFS in this path. so no need to use LOAD INPATH command here.

Its a good practice to store a source files in their private dedicated directories. So that there is no ambiguity while external tables are created as data is in a properly managed directory system.

Thanks a lot for helping me understand this concept guys! Cheers!

Anoop Mamgain
  • 187
  • 2
  • 3
  • 13