0

I'm writing a program for a daily upload to s3 of all our hive tables from a particular db. This database contains records from many years ago, however, and is way too large for a full copy/distcp.

I want to search the entire directory in HDFS that contains the db, and only grab the files with a last_modified_date that's after a specified (input) date.

I will then do the full distcp of these matching files to s3. (If I need to just copy down the paths/names of the matching files in a separate file, and then distcp from this extra file, that's fine too.)

Looking online, I've found that I can sort the files by their last modified date using the -t flag, so I started out with something like this: hdfs dfs -ls -R -t <path_to_db>, but this isn't enough. It's printing like 500000 files and I still need to figure out how to trim the ones that are from before this input date...

EDIT: I'm writing a Python script, sorry for not clarifying initially!

EDIT pt2: I should note that I need to traverse several thousand, or even several hundred thousand files. I've written a basic script in an attempt to solve my problem, but it takes an incredibly long time to run. Need a way to speed up the process....

tprebenda
  • 389
  • 1
  • 6
  • 17

2 Answers2

0

I'm not sure if you use Java but here is an example of what can do:. I mad some small modifications to use lastModified.

import java.io.*;
import java.util.*;
import java.net.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;


// For Date Conversion from long to human readable.
import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Date;

public class FileStatusChecker {
    public static void main (String [] args) throws Exception {
        try{
            FileSystem fs = FileSystem.get(new Configuration());
            String hdfsFilePath = "hdfs://My-NN-HA/Demos/SparkDemos/inputFile.txt";
            FileStatus[] status = fs.listStatus(new Path(hdfsFilePath));  // you need to pass in your hdfs path

            for (int i=0;i<status.length;i++){
                long lastModifiedTimeLong = status[i].lastModified();
                Date lastModifiedTimeDate = new Date(lastModifiedTimeLong);
                DateFormat df = new SimpleDateFormat("EEE, d MMM yyyy HH:mm:ss Z");
                System.out.println("The file '"+ hdfsFilePath + "' was accessed last at: "+ df.format(lastModifiedTimeDate));
            }
        }catch(Exception e){
            System.out.println("File not found");
            e.printStackTrace();
        }
    }
}

It would enable you to create a list of files and do "things" with them.

Matt Andruff
  • 4,974
  • 1
  • 5
  • 21
0

You can use WebHDFS to pull the exact same information: https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html

This might be more friendly to use with Python.

Examples:

Status of a File/Directory Submit a HTTP GET request.

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=GETFILESTATUS"
 The client receives a response with a FileStatus JSON object:

HTTP/1.1 200 OK
Content-Type: application/json
Transfer-Encoding: chunked

{
  "FileStatus":
  {
    "accessTime"      : 0,
    "blockSize"       : 0,
    "group"           : "supergroup",
    "length"          : 0,             //in bytes, zero for directories
    "modificationTime": 1320173277227,
    "owner"           : "webuser",
    "pathSuffix"      : "",
    "permission"      : "777",
    "replication"     : 0,
    "type"            : "DIRECTORY"    //enum {FILE, DIRECTORY}
  }
}

List a Directory Submit a HTTP GET request.

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS"
 The client receives a response with a FileStatuses JSON object:

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: 427

{
  "FileStatuses":
  {
    "FileStatus":
    [
      {
        "accessTime"      : 1320171722771,
        "blockSize"       : 33554432,
        "group"           : "supergroup",
        "length"          : 24930,
        "modificationTime": 1320171722771,
        "owner"           : "webuser",
        "pathSuffix"      : "a.patch",
        "permission"      : "644",
        "replication"     : 1,
        "type"            : "FILE"
      },
      {
        "accessTime"      : 0,
        "blockSize"       : 0,
        "group"           : "supergroup",
        "length"          : 0,
        "modificationTime": 1320895981256,
        "owner"           : "szetszwo",
        "pathSuffix"      : "bar",
        "permission"      : "711",
        "replication"     : 0,
        "type"            : "DIRECTORY"
      },
      ...
    ]
  }
}
Matt Andruff
  • 4,974
  • 1
  • 5
  • 21