0
  1. I added a CSV file in HDFS using R script.

  2. I update this CSV with new CSV/append data to it

  3. Created table using hue in Hive over this CSV.

  4. Altered it to be an external table.

Now, if when data is changed in the hdfs location, would data be automatically updated in hive table?

bartektartanus
  • 15,284
  • 6
  • 74
  • 102
systemdebt
  • 4,589
  • 10
  • 55
  • 116

1 Answers1

2

That's the thing with external (and also managed) tables in Hive. They're not really tables. You can think of them as link to HDFS location. So whenever you query external table, Hive reads all the data from location you selected when you created this table.

From Hive doc:

An EXTERNAL table points to any HDFS location for its storage, rather than being stored in a folder specified by the configuration property hive.metastore.warehouse.dir.

bartektartanus
  • 15,284
  • 6
  • 74
  • 102
  • 4
    Actually, even "managed" tables are links to HDFS directories, and whatever happens to the data files at HDFS level will show in the next Hive query *(which opens the door to a lot of funny things, e.g. backup/restore at HDFS level)*. The real difference is that a `drop` command on a "managed" table will nix the HDFS directory; on an EXTERNAL table it will leave the files & directory alone. – Samson Scharfrichter Jun 06 '16 at 16:03
  • 2
    Yes, managed tables are also not real tables :) I've edited my answer. – bartektartanus Jun 06 '16 at 17:36
  • does hive pull external/internal table data to in-memory when a SELECT is made to the table? I'm debating if it doesn't store in-memory, how does it executes fast the second time than the first time it runs. – NK7983 Oct 16 '20 at 16:52