-1

I'm new to Spark (although I've Hadoop and MapReduce experience) and am trying to process a giant file with a JSON record per line. I'd like to do some transformation on each line and write an output file every n records (say, 1 million). So if there are 7.5 million records in the input file, 8 output files should be generated. How can I do this? You may provide your answer in either Java or Scala.

Using Spark v2.1.0.

Abhijit Sarkar
  • 21,927
  • 20
  • 110
  • 219
  • Even if it's possible, why? If you have HDFS, you're going to have partition files that are going to be split around the HDFS block size... That being said, your data can (and will) be split **in the middle of a line** – OneCricketeer Mar 02 '17 at 05:14
  • In other words, there are many Spark threads and processes reading your file. You can't just tell, okay process1, you get 1 million rows, and process2, you get the next ones... If you are doing that, you might as well not use Spark. – OneCricketeer Mar 02 '17 at 05:16
  • @cricket_007 These files will be eventually used to populate data in Couchbase and we may not load all of them at the same time. So we want bite-sized chunks. I chose Spark so that I can scale if I need to but if it can't achieve my requirement, I'll have to find another tool that can. – Abhijit Sarkar Mar 02 '17 at 05:18
  • Regarding your 2nd comment, if it matters, I don't care how the file is read or how many threads are doing so. I want the output to be partitioned by record number. – Abhijit Sarkar Mar 02 '17 at 05:20
  • Why can't you use Couchbase's spark connector to feed it directly? – OneCricketeer Mar 02 '17 at 05:34
  • The 2 processes are separated in time. Data processing and data loading are not connected. – Abhijit Sarkar Mar 02 '17 at 05:35
  • Have you seen this yet? http://stackoverflow.com/a/40321324/2308683 – OneCricketeer Mar 02 '17 at 05:35

1 Answers1

0

You could use something like:

val dataCount = data.count
val numPartitions = math.ceil(dataCount.toDouble/100000).toInt
val newData = data.coalesce(numPartitions)
newData.saveAsTextFile("output path")

I'm on my windows gaming computer at the moment. So this code is untested, and probably contains minor errors. But in general that should work.

ref: Spark: Cut down no. of output files

As a side note, while controlling your partition size isn't a bad idea, arbitrarily deciding you want 1 million records in a partition is probably not the way to go. In general, you fiddle with partition sizes to optimize your cluster utilization.

EDIT: I should note this won't guarantee you will have a million records per partition just that you should have something in that ball park.

Community
  • 1
  • 1
Robert Beatty
  • 508
  • 5
  • 11
  • 1
    Doesn't work. I don't think `coalesce` directly controls the number of output files as suggested by the thread you linked to. In fact, your answer probably should be a comment with a link to the [Spark: Cut down no. of output files](http://stackoverflow.com/questions/25967961/spark-cut-down-no-of-output-files). It seems there are others trying to use `coalesce` in vain: [Spark dataFrame.colaesce...does not seem to work for me](http://stackoverflow.com/questions/31346647/spark-dataframe-colaesce1-or-dataframe-reapartition1-does-not-seem-to-work-f) – Abhijit Sarkar Mar 05 '17 at 03:13
  • I agree with the comment comment. Unfortunately, the stupid way rep works on here, I can answer questions and not make comments. As to your main point, while I haven't tested the code above. I have used coalesce successfully in the past. But to your point and as I commented above, yes, it's not really for managing file numbers. It's more about optimizing your cluster usage, but should have the side effect of getting your file number/size closer to what you want. If having exactly a million records per file is a hard req of your project, you probably shouldn't be using spark. The underlying – Robert Beatty Mar 07 '17 at 15:52
  • nature of rdds just doesn't work that way. And any solution you find to make it work will almost assuredly be kludgy and a sub-optimal way of using spark. – Robert Beatty Mar 07 '17 at 16:01