1

When I run drop database command, spark deletes database directory and all its subdirectories on hdfs. How can I avoid this?

Looper
  • 13
  • 3

1 Answers1

0

Short answer:

Unless you set up your database so that it contains only external tables that exist outside of the database HDFS directory, there is no way to achieve this without copying all of your data to another location in HDFS.

Long answer:

From the following website: https://www.oreilly.com/library/view/programming-hive/9781449326944/ch04.html

By default, Hive won’t permit you to drop a database if it contains tables. You can either drop the tables first or append the CASCADE keyword to the command, which will cause the Hive to drop the tables in the database first:

Using the RESTRICT keyword instead of CASCADE is equivalent to the default behavior, where existing tables must be dropped before dropping the database.

When a database is dropped, its directory is also deleted.

You can copy the data to another location before dropping the database. I know it's a pain - but that's how Hive operates.

If you were trying to just drop a table without deleting the HDFS directory of the table, there's a solution for this described here: Can I change a table from internal to external in hive?

Dropping an external table preserves the HDFS location for the data.

Cascading the database drop to the tables after converting them to external will not fix this, because the database drop impacts the whole HDFS directory the database resides in. You would still need to copy the data to another location.

If you create a database from scratch, each table inside of which is external and references a location outside of the database HDFS directory, dropping this database would preserve the data. But if you have it set up so that the data is currently inside of the database HDFS directory, you will not have this functionality; it's something you would have to set up from scratch.

Community
  • 1
  • 1
Jonathan Myers
  • 930
  • 6
  • 17