Copy files from a hdfs folder to another hdfs location by filtering with modified date using shell script

Question

I have 1 year data in my hdfs location and i want to copy data for last 6 months into another hdfs location. Is it possible to copy data only for 6 months directly from hdfs command or do we need to write shell script for copying data for last 6 months?

I have tried hdfs commands for performing this, but didn't work.

I tried with the below shell script and it was working fine till creating TempFile but throwing an error

$ sh scriptnew.sh
scriptnew.sh: line 8: syntax error: unexpected end of file

and script is not executed further.

Below is the shell script which i used.

#!/bin/bash
hdfs dfs -ls /hive/warehouse/data.db/all_history/ |awk 'BEGIN{ SIXMON=60*60*24*180; "date +%s" | getline NOW } { cmd="date -d'\''"$6" "$7"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-SIXMON; if(WHEN > DIFF){print $8}}' >> TempFile.txt
cat TempFile.txt |while read line
do
    echo $i
    hdfs dfs -cp -p $line /user/can_anns/all_history_copy/;
done

What might be the error and how to resolve this ?

Antony · Accepted Answer · 2019-07-19T11:05:25.617

For copying 6 months files from a hdfs location to another we can use the below script.

script should be run from your local linux location.

#!/bin/bash
hdfs dfs -ls /hive/warehouse/data.db/all_history/ |awk 'BEGIN{ SIXMON=60*60*24*180; "date +%s" | getline NOW } { cmd="date -d'\''"$6" "$7"'\'' +%s"; cmd | getline WHEN; DIFF=NOW-SIXMON; if(WHEN > DIFF){print $8}}' >> TempFile.txt
cat TempFile.txt |while read line
do
   echo $i
   hdfs dfs -cp -p $line /user/can_anns/all_history_copy/;
done

Line 2 : We are copying list of files which are of max 180 days to a TempFile. Then we iterate through this Temp file and if match is found then copy the file.

If you are writing the script from windows and copying to linux machine, sometimes it may not work showing syntax error. For avoiding the carriage return error, after copying the script to linux machine local path run the below command. sed -i 's/\r//' Then run the script >>> sh FileName.sh

score 0 · Answer 2 · answered Jul 16 '19 at 14:56

0

I think you can do it through a shell script like the below in three runs. It's just a modified version of your script. I tried and it works for me.

In each run, you need to modify the grep condition with the required month for three months. (2019-03, 2019-02, 2019-01)

Script:

hdfs dfs -ls /hive/warehouse/data.db/all_history/|grep "2019-03"|awk '{print $8}' >> Files.txt
cat Files.txt |while read line
do
    echo $i
    hdfs dfs -cp $line /user/can_anns/all_history_copy/;
done

Hope that helps!

answered Jul 16 '19 at 14:56

Gomz

850
7
17

above code is not working if i am giving "2019-03" but it will work for a particular date like "2019-03-01". Is there any scripts available for run this in a single go? – Antony Jul 17 '19 at 07:02
Can you try just the first line and see what result does it return. It seems to be working for me. `hdfs dfs -ls -R /user/hive/warehouse/|grep "2019-07"|awk '{print $8}' /user/hive/warehouse/customers /user/hive/warehouse/customers/customers /user/hive/warehouse/sample_07` – Gomz Jul 17 '19 at 11:26
printing is working. but while copying it is failing by showing syntax error for -cp comand. – Antony Jul 18 '19 at 03:50
What error do you see? Can you paste the complete syntax error here please? – Gomz Jul 18 '19 at 05:05
@Antony- try ( for filename in hadoop fs -ls /hive/warehouse/data.db/all_history/copy_55 | grep '.*2019-[10-12].*' | awk '{print $8}' do hdfs dfs -cp $filename /user/can_anns/all_history_copy/; ) post error incase you get any – vikrant rana Jul 19 '19 at 00:53
@Gomathinayagam : I have updated the question with the progress and latest error. please help! – Antony Jul 19 '19 at 05:00
1

@vikrant rana : I have updated the question with the progress and latest error. – Antony Jul 19 '19 at 05:01
@Antony - The script works like a charm to me. I am not sure if you have extra lines in the shell script. As you can see, the error occurs from line #8. But there is no 8th line in the script you have updated in the question. Please check if you have anything added at the end of file. You can retype the script once in a new file and try. Hopefully it should work. – Gomz Jul 19 '19 at 05:20
Tried with new file also. but still same error. I don't have any other character near *done* also. – Antony Jul 19 '19 at 05:34
Can you try doing a `wc -l scriptnew.sh` and see the number of lines in it? It should not be 8 if I am right. – Gomz Jul 19 '19 at 05:37
1

@Gomz I tried **sed -i 's/\r//' script.sh** after copying my file from windows to linux local directory and this is working fine now. – Antony Jul 19 '19 at 09:42
@Antony. Cool! So the carriage return in the file was the problem! Thanks for updating! I did not know that you have created the file in windows machine. This is expected when script files are copied from Windows to Linux in non-linux supported format. – Gomz Jul 19 '19 at 09:49

score 0 · Answer 3 · answered Jul 16 '19 at 16:35

0

I assume the dataset has date column. So, you could create an external hive table on that dataset and extract just the required data.

If there are huge number of records for a given date, shell script works very slow.

answered Jul 16 '19 at 16:35

Karthik

11
5

Copy files from a hdfs folder to another hdfs location by filtering with modified date using shell script

3 Answers3