Spark SQL data storage life cycle

Question

I recently had a issue with with one of my spark jobs, where I was reading a hive table having several billion records, that resulted in job failure due to high disk utilization, But after adding AWS EBS volume, the job ran without any issues. Although it resolved the issue, I have few doubts, I tried doing some research but couldn't find any clear answers. So my question is?

when a spark SQL reads a hive table, where the data is stored for processing initially and what is the entire life cycle of data in terms of its storage , if I didn't explicitly specify anything? And How adding EBS volumes solves the issue?

score 1 · Answer 1 · answered Nov 03 '21 at 15:13

1

Initially the data is in table location in HDFS/S3/etc. Spark spills data on local storage if it does not fit in memory.

Read Apache Spark FAQ

Does my data need to fit in memory to use Spark?

No. Spark's operators spill data to disk if it does not fit in memory, allowing it to run well on any sized data. Likewise, cached datasets that do not fit in memory are either spilled to disk or recomputed on the fly when needed, as determined by the RDD's storage level.

answered Nov 03 '21 at 15:13

leftjoin

36,950
8
57
116

When you say memory and local storage, Is it the driver (memory and local storage) or the executors? And How adding a EBS volume solves the issues? is it equivalent to local storage in AWS ? – user7343922 Nov 03 '21 at 19:48
Edited the question to include AES EBS volume https://aws.amazon.com/premiumsupport/knowledge-center/executorlostfailure-slave-lost-emr/ – user7343922 Nov 03 '21 at 19:55
Executors local memory and storage. EBS volumes are storage attached to the nodes, though EBS are like network attached drives. You probably can fix the issue adding instance storage as well. The difference explained here: https://medium.com/awesome-cloud/aws-difference-between-ebs-and-instance-store-f030c4407387 – leftjoin Nov 04 '21 at 08:57

Reeves · Answer 2 · 2021-11-03T16:08:55.743

Whenever spark reads data from hive tables, it stores it in RDD. One point i want to make clear here is hive is just a warehouse so it is like a layer which is above HDFS, when spark interacts with hive , hive provides the spark the location where the hdfs loaction exists.

Thus, Spark reads a file from HDFS, it creates a single partition for a single input split. Input split is set by the Hadoop (whatever the InputFormat used to read this file. ex: if you use textFile() it would be TextInputFormat in Hadoop, which would return you a single partition for a single block of HDFS (note:the split between partitions would be done on line split, not the exact block split), unless you have a compressed file format like Avro/parquet.

If you manually add rdd.repartition(x) it would perform a shuffle of the data from N partititons you have in rdd to x partitions you want to have, partitioning would be done on round robin basis.

If you have a 10GB uncompressed text file stored on HDFS, then with the default HDFS block size setting (256MB) it would be stored in 40blocks, which means that the RDD you read from this file would have 40partitions. When you call repartition(1000) your RDD would be marked as to be repartitioned, but in fact it would be shuffled to 1000 partitions only when you will execute an action on top of this RDD (lazy execution concept)

Now its all up to spark that how it will process the data as Spark is doing lazy evaluation , before doing the processing, spark prepare a DAG for optimal processing. One more point spark need configuration for driver memory, no of cores , no of executors etc and if the configuration is inappropriate the job will fail.

Once it prepare the DAG , then it start processing the data. So it divide your job into stages and stages into tasks. Each task will further use specific executors, shuffle , partitioning. So in your case when you do processing of bilions of records may be your configuration is not adequate for the processing. One more point when we say spark load the data in RDD/Dataframe , its managed by spark, there are option to keep the data in memory/disk/memory only etc ref -storage_spark.

Briefly,

Hive-->HDFS--->SPARK>>RDD(Storage depends as its a lazy evaluation).

you may refer the following link : Spark RDD - is partition(s) always in RAM?

jgp · Accepted Answer · 2021-11-03T21:47:16.900

1

Spark will read the data, if it does not fit in memory, it will spill it out on disk.

A few things to note:

Data in memory is compressed, from what I read, you gain about 20% (e.g. a 100MB file will take only 80MB of memory).
Ingestion will start as soon as you read(), it is not part of the DAG, you can limit how much you ingest in the SQL query itself. The read operation is done by the executors. This example should give you a hint: https://github.com/jgperrin/net.jgp.books.spark.ch08/blob/master/src/main/java/net/jgp/books/spark/ch08/lab300_advanced_queries/MySQLWithWhereClauseToDatasetApp.java
In latest versions of Spark, you can push down the filter (for example if you filter right after the ingestion, Spark will know and optimize the ingestion), I think this works only for CSV, Avro, and Parquet. For databases (including Hive), the previous example is what I'd recommend.
Storage MUST be seen/accessible from the executors, so if you have EBS volumes, make sure they are seen/accessible from the cluster where the executors/workers are running, vs. the node where the driver is running.

edited Nov 03 '21 at 21:47

answered Nov 03 '21 at 15:27

jgp

2,069
1
21
40

When you say memory and disk,Is it the driver (memory and local storage) or the executors? And How adding a EBS volume solves the issues? is it equivalent to local storage in AWS ? – user7343922 Nov 03 '21 at 19:49
Edited the question to include EBS volume https://aws.amazon.com/premiumsupport/knowledge-center/executorlostfailure-slave-lost-emr/ – user7343922 Nov 03 '21 at 19:55
Executor - unless your code contains collect() or similar functions. EBS is just disk available to your server/cluster. You may need more "local" storage. – jgp Nov 03 '21 at 21:45
Added an extra bullet point on EBS, hih. – jgp Nov 03 '21 at 21:47

Spark SQL data storage life cycle

3 Answers3