Move file from local to HDFS

Question

My environment uses Spark, Pig and Hive.

I am having some trouble to write a code in Scala (or any other language compatible with my environment) that could copy a file from a local file system to HDFS.

Does anyone have any advices on how I should proceed?

score 8 · Answer 1 · answered May 20 '16 at 13:48

The other answers didn't work for me, so I am writing another one here.

Try the following Scala code:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)

val srcPath = new Path(srcFilePath)
val destPath = new Path(destFilePath)

hdfs.copyFromLocalFile(srcPath, destPath)

You should also check if Spark has the HADOOP_CONF_DIR variable set in the conf/spark-env.sh file. This will make sure that Spark is going to find the Hadoop configuration settings.

The dependencies for the build.sbt file:

libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0"
libraryDependencies += "org.apache.commons" % "commons-io" % "1.3.2"
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"

score 5 · Answer 2 · edited May 19 '16 at 21:23

You could write a Scala job using Hadoop FileSystem API.
And use IOUtils from apache commons to copy data from InputStream to OutputStream

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.commons.io.IOUtils;



val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);

//Create output stream to HDFS file
val outFileStream = fs.create(new Path("hedf://<namenode>:<port>/<filename>))

//Create input stream from local file
val inStream = fs.open(new Path("file://<input_file>"))

IOUtils.copy(inStream, outFileStream)

//Close both files
inStream.close()
outFileStream.close()

Thanks a lot! Dependencies: `libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0"` `libraryDependencies += "org.apache.commons" % "commons-io" % "1.3.2"` `libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"` — beloblotskiy, Oct 01 '15 at 19:38

score 0 · Answer 3 · answered May 17 '18 at 23:41

Here is something that works for S3 (modified from above)

def cpToS3(localPath: String, s3Path: String) = {
  val hdfs = FileSystem.get(
               new URI(s3Path), 
               spark.sparkContext.hadoopConfiguration)

  val srcPath = new Path(localPath)
  val destPath = new Path(s3Path)

  hdfs.copyFromLocalFile(srcPath, destPath)
}

Move file from local to HDFS

3 Answers3