0

I am (trying) to load about 40 large json files (150 - 200GB each on average) into Spark using sparklyr. Some of the files would fit entirely in the RAM of a cluster, some of them would be too big.

Unfortunately, the command :

spark_read_json(mypath/files_json*, memory = FALSE)

creates about 500k jobs and takes forever, no matter what is my cluster allocation is (number of cores allocated, number of executors, RAM, etc).

I kept playing with config <- spark_config(), config$spark.executor.memory, config$spark.executor.core, config$spark.default.parallelism and others but the number of tasks does not change. I do have a large cluster available though.

I feel there is a serious optimization problem here. Any idea what should I change in the Spark options or somewhere? I desperately tried playing with all of the options I could think of.

Thanks!!!

ℕʘʘḆḽḘ
  • 18,566
  • 34
  • 128
  • 235
  • 1
    How many json files do you have ? spark doesn't do well with reading small files. – eliasah Jun 29 '17 at 11:50
  • @eliasah about 40. `200GB` each on average. some of them would fit in the RAM, some of them would be too big. – ℕʘʘḆḽḘ Jun 29 '17 at 11:54
  • 1
    I'm not sure there is much to do with sparklyr here nor with plain spark for the matter. Those file are huge and they shouldn't be. – eliasah Jun 29 '17 at 12:01
  • @eliasah dude big data is big data. I was able to process these files in `PIG` easily. There has to be some rational explanation here. Something hugely under-optimized – ℕʘʘḆḽḘ Jun 29 '17 at 12:02
  • 1
    I know what big data is @Noobie. :) – eliasah Jun 29 '17 at 12:04
  • 1
    The rational explanation is that spark is trying to partition those files to fit into memory and also to be able to parallelize computation afterwards (to be short) – eliasah Jun 29 '17 at 12:05
  • 1
    Adding to that the fact that JSON are very slow to be read because they take lots of space in memory. (Everything is string) They also need to be parsed (Again, everything is string) – eliasah Jun 29 '17 at 12:09
  • @eliasah yeah but the option `memory = FALSE` is supposed to map the files only, not to load them into memory. Indeed, what I want to do is do some filtering, some simple processing. – ℕʘʘḆḽḘ Jun 29 '17 at 12:11
  • My personal opinion is to convert those json to parquet format. Once it's done it's done. Then processing them will be much faster – eliasah Jun 29 '17 at 12:12
  • 1
    Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147928/discussion-between-eliasah-and-noobie). – eliasah Jun 29 '17 at 12:12
  • @eliasah cannot use the chat, sorry. How would you convert them to Parquet from R? why would that be faster? – ℕʘʘḆḽḘ Jun 29 '17 at 12:13
  • 1
    Here is another problem when you don't load into. Spark's power and speed is accessing fast memory (RAM) so if they stay on disk, the access is at least 200x slower – eliasah Jun 29 '17 at 12:13
  • 1
    If you are stuck in with just using R. You can read json and write them in parquet with a schema. Otherwise it's better to process them with scala/python spark – eliasah Jun 29 '17 at 12:14
  • @eliasah maybe you should post an answer based on that. that is interesing. However, Spark must be good also at processing larger-than-RAM datasets, otherwise this is not a big data technology – ℕʘʘḆḽḘ Jun 29 '17 at 12:16
  • 1
    Have you compared spark vs pig processing on that data ? – eliasah Jun 29 '17 at 12:35
  • 1
    I'm not sure about this being a valid answer format for the SO as it's more a discussion than a concrete answer. – eliasah Jun 29 '17 at 12:36
  • yes, PIG much faster but on a different cluster. Now I have no other choice than using Spark – ℕʘʘḆḽḘ Jun 29 '17 at 12:36
  • 1
    What kind of cluster if I may ask ? – eliasah Jun 29 '17 at 12:38

0 Answers0