0

Is there a way to copy a list of files from S3 to hdfs instead of complete folder using s3distcp? this is when srcPattern can not work.

I have multiple files on a s3 folder all having different names. I want to copy only specific files to a hdfs directory. I did not find any way to specify multiple source files path to s3distcp.

Workaround that I am currently using is to tell all the file names in srcPattern

hadoop jar s3distcp.jar
    --src s3n://bucket/src_folder/
    --dest hdfs:///test/output/
    --srcPattern '.*somefile.*|.*anotherone.*'

Can this thing work when the number of files is too many? like around 10 000?

its me
  • 127
  • 2
  • 8

2 Answers2

4

hadoop distcp should solve your problem. we can use distcp to copy data from s3 to hdfs.

And it also supports wildcards and we can provide multiple source paths in the command.

http://hadoop.apache.org/docs/r1.2.1/distcp.html

Go through the usage section in this particular url

Example: consider you have the following files in s3 bucket(test-bucket) inside test1 folder.

abc.txt
abd.txt
defg.txt

And inside test2 folder you have

hijk.txt
hjikl.txt
xyz.txt

And your hdfs path is hdfs://localhost.localdomain:9000/user/test/

Then distcp command is as follows for a particular pattern.

hadoop distcp s3n://test-bucket/test1/ab*.txt \ s3n://test-bucket/test2/hi*.txt hdfs://localhost.localdomain:9000/user/test/
  • Problem is, I need to use --compressionCodec option of s3distcp, this option is not there for distcp that's why I can't use distcp. – its me Oct 25 '14 at 12:21
3

Yes you can. create a manifest file with all the files you need and use --copyFromManifest option as mentioned here

Eitan Illuz
  • 323
  • 2
  • 7
  • you mean I should write all file names (S3 paths) to manifest fine? – its me Dec 12 '14 at 12:50
  • Yes. If you want an example of a manifest file just run s3distcp with the --outputManifest option and it will generate a manifest file of all files it copied. – Eitan Illuz Dec 14 '14 at 12:27
  • I've tried this by generating a list of the 50k files I want (in the manifest format), but this case it's unclear what to use in the required "--src" argument. – conradlee Aug 04 '16 at 18:17
  • S3Distcp will only check if its a valid path (exists on hdfs\s3) but it will ignore it if you provide the manifest. – Eitan Illuz Aug 06 '16 at 06:49