-1

I am trying to write a file to hdfs using spark-submit. When writing a file, I want to split one file into several, like the result of mapreduce rather than one.(ex. part-0000, park-0001)

here is sample code What options should I set

val conf = new Configuration()
val fs= FileSystem.get(conf)
val output = fs.create(new Path("/user/foo/test.txt"))
val writer = new PrintWriter(output)
writer.write("Hellow World1\n")
writer.write("Hellow World2\n")
...
writer.write("Hellow World3\n")
myskbj
  • 31
  • 4

1 Answers1

0

You can control the number of output files in spark using repartition and coalesce. In mapreduce you control the output files by number of reducers similarly in spark you can specify partitions and coalesce

dataRDD.repartition(2).saveAsTextFile("/user/cloudera/sqoop_import/orders_test")

as shown in above command will save data in two files as we have specified the partition spec as 2

You can take a look at this answer it will help you understand

Strick
  • 1,512
  • 9
  • 15
  • The data is not in the dataframe. I just want to collect the output log and save it to a file. – myskbj Oct 23 '19 at 06:27
  • can you be please more specific with the question what are you trying to achieve, if you are not doing any processing then why using spark, you can directly use simple any program using FileSystem API and put files. Why there is need to use spark – Strick Oct 23 '19 at 06:30
  • I want to load data from a table in dbms (select * from) and save it to a file in hdfs after simple data processing. – myskbj Oct 23 '19 at 06:33
  • You are right. There is no reason to use sparks.   Is there a way to save a single file into multiple files using the FileSystem API? – myskbj Oct 23 '19 at 06:36
  • I could find this on internet hope it can solve your purpose https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html read this also https://javadeveloperzone.com/hadoop/hadoop-multiple-outputs-example/ – Strick Oct 23 '19 at 06:38