1

My environment uses Spark, Pig and Hive.

I am having some trouble to write a code in Scala (or any other language compatible with my environment) that could copy a file from a local file system to HDFS.

Does anyone have any advices on how I should proceed?

matheusdc
  • 205
  • 2
  • 6
Shakile
  • 343
  • 2
  • 5
  • 13

3 Answers3

8

The other answers didn't work for me, so I am writing another one here.

Try the following Scala code:

import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path

val hadoopConf = new Configuration()
val hdfs = FileSystem.get(hadoopConf)

val srcPath = new Path(srcFilePath)
val destPath = new Path(destFilePath)

hdfs.copyFromLocalFile(srcPath, destPath)

You should also check if Spark has the HADOOP_CONF_DIR variable set in the conf/spark-env.sh file. This will make sure that Spark is going to find the Hadoop configuration settings.

The dependencies for the build.sbt file:

libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0"
libraryDependencies += "org.apache.commons" % "commons-io" % "1.3.2"
libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"
matheusdc
  • 205
  • 2
  • 6
5

You could write a Scala job using Hadoop FileSystem API.
And use IOUtils from apache commons to copy data from InputStream to OutputStream

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.commons.io.IOUtils;



val hadoopconf = new Configuration();
val fs = FileSystem.get(hadoopconf);

//Create output stream to HDFS file
val outFileStream = fs.create(new Path("hedf://<namenode>:<port>/<filename>))

//Create input stream from local file
val inStream = fs.open(new Path("file://<input_file>"))

IOUtils.copy(inStream, outFileStream)

//Close both files
inStream.close()
outFileStream.close()
matheusdc
  • 205
  • 2
  • 6
shanmuga
  • 4,329
  • 2
  • 21
  • 35
  • Thanks a lot! Dependencies: `libraryDependencies += "org.apache.hadoop" % "hadoop-common" % "2.6.0"` `libraryDependencies += "org.apache.commons" % "commons-io" % "1.3.2"` `libraryDependencies += "org.apache.hadoop" % "hadoop-hdfs" % "2.6.0"` – beloblotskiy Oct 01 '15 at 19:38
0

Here is something that works for S3 (modified from above)

def cpToS3(localPath: String, s3Path: String) = {
  val hdfs = FileSystem.get(
               new URI(s3Path), 
               spark.sparkContext.hadoopConfiguration)

  val srcPath = new Path(localPath)
  val destPath = new Path(s3Path)

  hdfs.copyFromLocalFile(srcPath, destPath)
}
John Zhu
  • 151
  • 2
  • 9