0

I want to filter some files for date (I can't use find, because the files are in HDFS). The solution that I find is using awk.

This is an example of data that I want to process

drwxrwx--x+  - hive     hive                  0 2019-01-01 20:02 /dat1
drwxrwx--x+  - hive     hive                  0 2019-01-02 16:38 /dat2
drwxrwx--x+  - hive     hive                  0 2019-01-03 16:59 /dat3

If I use this command:

$ ls -l |awk '$6 > "2019-01-02"'
drwxrwx--x+  - hive     hive                  0 2019-01-03 16:59 /dat3

I don't have any problems, but If I want to create a script to help me to filter 2 days ago, I add in the awk the expression:

$ date +%Y-%m-%d --date='-2 day'
2019-01-02

It is something like this, but isn't working:

ls -l |awk '$6 >" date +%Y-%m-%d --date=\'-2 day\'"'   
>

It's like something is missing, but I don't know what it is.

kvantour
  • 25,269
  • 4
  • 47
  • 72
Skiel
  • 307
  • 1
  • 12
  • There must be ways with `hadoop` or other commands to do this correctly. I found some references [here](https://issues.apache.org/jira/browse/HADOOP-8989) as well as [here](https://stackoverflow.com/questions/32896393/is-there-the-equivalent-for-a-find-command-in-hadoop) and [here](https://stackoverflow.com/a/39514961/8344060). But I cannot test them. – kvantour Jan 04 '19 at 11:53

2 Answers2

0

First of all, Never try to parse the output of ls.

If you want to get your hands on the files/directories that are maximum n days old, which are in a directory /path/to/dir/

$ find /path/to/dir -type f -mtime -2 -print
$ find /path/to/dir -type d -mtime -2 -print

The first one is for files, the second for directories.

If you still want to parse ls with awk, you might try somthing like this:

$ ls -l | awk -v d=$(date -d "2 days ago" "+%F") '$6 > d'

The problem you are having is that you are nesting double quotes into single quotes.

kvantour
  • 25,269
  • 4
  • 47
  • 72
0

Parsing the output of ls and manipulating the mod-time of the files is generally not recommended. But, if you stick to yyyymmdd format, then below workaround will help you. I use this hack for my daily chores as it uses number comparisons

$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt
-rw-r--r-- 1 user1234 unixgrp    34 20181231 delete_5lines.txt
-rw-r--r-- 1 user1234 unixgrp   226 20190101 jobinfo.txt
-rw-r--r-- 1 user1234 unixgrp  7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat
-rw-r--r-- 1 user1234 unixgrp    58 20190103 stan.in

Get files after Jan-3rd

$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt |  awk ' $6>20190103' 
-rw-r--r-- 1 user1234 unixgrp  7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat

Get files on/after Jan-3rd..

$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt |  awk ' $6>=20190103' 
-rw-r--r-- 1 user1234 unixgrp  7120 20190104 report.txt
-rw-r--r-- 1 user1234 unixgrp 70555 20190104 sample.dat
-rw-r--r-- 1 user1234 unixgrp    58 20190103 stan.in

Exactly Jan-3rd

$ ls -l --time-style '+%Y%m%d' delete_5lines.txt jobinfo.txt stan.in sample.dat report.txt |  awk ' $6==20190103' 
-rw-r--r-- 1 user1234 unixgrp    58 20190103 stan.in

You can alias it like

$ alias lsdt=" ls -l --time-style '+%Y%m%d' "

and use it like

$ lsdt jobinfo.txt stan.in sample.dat report.txt

Note: Again, you should avoid it if you are going to use it for scripts... just use it for day-to-day chores

stack0114106
  • 8,534
  • 3
  • 13
  • 38