2

I'm new to spark and have no programming experience in Java. I'm using pyspark to process a very large time-series dataset with close to 4000 numeric (float) columns and billions of rows.

What I want to achieve with this dataset is the following:

The time-series data is at 10ms intervals. I want to group the data by 1s intervals and use mean as the aggregation function.

Here is the code that I'm using to read the partitioned parquet files.

df = (spark.read.option("mergeSchema", "true")
           .parquet("/data/"))

Here is the piece of code for groupby and aggregation that I wrote:

col_list = [... list of numeric columns in the dataframe ...]

agg_funcs = [mean]   # I also want to add other aggregation functions here later.

exprs     = [f(df[c]).alias(f.__name__ + '_' + c) for f in agg_funcs for c in col_list]

result = (df.groupBy(['Year', 'Month', 'Day', 'Hour', 'Minute', 'Second'])
            .agg(*exprs))

Now, I want to write the above result dataframe to a partitioned parquet:

(result.write.mode("overwrite")
       .partitionBy('Year', 'Month', 'Day', 'Hour', 'Minute', 'Second')
       .parquet('/out/'))

But, I get a java heap out of memory error.

I tried increasing spark.sql.shuffle.partitions so that each partition will be of a smaller size, but that didn't help.

My spark cluster configuration:

2 workers + 1 master
Both the worker nodes have 256 GB RAM and 32 cores each.
Master node has 8 cores and 32 GB RAM.

The configuration I'm specifying for my spark job is:

{
    "driverMemory": "8G", 
    "driverCores": 4, 
    "executorMemory": "20G", 
    "executorCores": 4, 
    "numExecutors": 14, 
    "conf": {
        "spark.sql.shuffle.partitions": 2000000
    }
}

Following are some screenshots from Ambari regarding the configurations of the cluster:

YARN memory

YARN CPU

Can somebody please help me understand why there is a memory issue and how to fix it? Thanks.

varun
  • 31
  • 1
  • 7

2 Answers2

2

I believe this is happening because of data skew and one of your partitions is getting OOM.

Spark’s groupBy() requires loading all of the key values into memory at once to perform groupby.

Increasing partitions is not working because you might be having large data with similar group by key. Check if you have data skew with a similar group by key.

Check this article which explains this better.

  • As far as I know, there is no data skew, but I'll get back once I really check it. The data I'm dealing with is sensor data that is collected at 10ms intervals. So, the data is pretty much evenly spaced. – varun Oct 04 '19 at 09:04
0

why don't you concatenate 'Year', 'Month', 'Day', 'Hour', 'Minute', 'Second' before doing the groupBy. After the groupBy you can recreate these columns. I think try without changing executor-cores and then reduce it to 15 instead and then to 7. 4 is too low I think

firsni
  • 856
  • 6
  • 12
  • From several blogs I read, the recommendation is to keep the executor cores below 5 to not hurt HDFS I/O throughput. Is that not correct? For example, this [cloudera blog](https://blog.cloudera.com/how-to-tune-your-apache-spark-jobs-part-2/) says **15 cores per executor can lead to bad HDFS I/O throughput**. – varun Oct 04 '19 at 09:02
  • I agree but here you need more memory per executor I think. That's why I suggested to reduce the number of executors. I'm not sure it will work. you can begin with concatenating the fields before changing the config. – firsni Oct 04 '19 at 09:20
  • I did a groupBy on 'epoch' instead of separate columns and also reduced the number of executors. This time, I don't get any out of memory error, but I'm getting a null pointer exception at `org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPage(VectorizedColumnReader.java:536)`. I also noticed that only a few tasks are doing all the reading/processing. – varun Oct 07 '19 at 04:00
  • you can repartition your data before doing the groupBy in order to divide the data across the cluster. Can you paste your code used to read parquet data ? – firsni Oct 07 '19 at 07:18
  • I updated my question with the code I'm using to read the partitioned parquet files. I tried reading a larger dataset this time and more tasks are reading the data. Earlier, the size of the dataset I was testing on was only 30 MB compressed (4.5 GB uncompressed). Also, I think the null pointer exception is due to malformed parquet because I tried on a proper parquet file and there is no NPE. Weirdly, `pandas + fastparquet` is able to load the malformed parquet file whereas `pandas + pyarrow` complains that for a particular column the expected length is more than the actual length. – varun Oct 07 '19 at 11:15
  • you can try to convert the pandas df into spark df. – firsni Oct 07 '19 at 11:44
  • The repartition is only after the parquet files have been read right? i.e, from the second stage onwards? The number of partitions in the first stage is governed by the way the parquet files are written or the `spark.sql.files.maxPartitionBytes` config right? How do I speed up the reads from HDFS in the first stage itself? Currently, from the way the number of input records is increasing in the spark UI, I think I'm only getting about 25 Mbps. – varun Oct 07 '19 at 11:45
  • you can change the default parallelism by setting sparkContext default parallelism. This will change the number of tasks. https://stackoverflow.com/questions/44222307/spark-rdd-default-number-of-partitions – firsni Oct 07 '19 at 11:55
  • I tried `spark.default.parallelism` and `spark.sql.shuffle.partitions`. These only change the number of tasks/partitions after the first stage (where the file scan is taking place). The only way to increase the number of tasks during stage 1 itself (file scan) that works for me is `spark.sql.files.maxPartitionBytes` and sadly even that did not increase the HDFS I/O throughput. – varun Oct 08 '19 at 05:13
  • Sorry for the lengthy comments, but how do I find out the minimum amount of memory needed per executor so as to avoid the OOM? Does it depend on the size of a partition? – varun Oct 08 '19 at 05:14
  • @varun did you solve the issue? I'm facing the same problem here, and similar dataframe. – magavo Jul 08 '21 at 02:11