I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp.
I want to search the entire directory in HDFS that contains the db, and only grab the files with a last_modified_date that's after a specified (input) date.
I will then do the full distcp of these matching files to s3. (If I need to just copy down the paths/names of the matching files in a separate file, and then distcp from this extra file, that's fine too.)
Looking online, I've found that I can sort the files by their last modified date using the -t
flag, so I started out with something like this: hdfs dfs -ls -R -t <path_to_db>
, but this isn't enough. It's printing like 500000 files and I still need to figure out how to trim the ones that are from before this input date...
EDIT: I'm writing a Python script, sorry for not clarifying initially!
EDIT pt2: I should note that I need to traverse several thousand, or even several hundred thousand files. I've written a basic script in an attempt to solve my problem, but it takes an incredibly long time to run. Need a way to speed up the process....