I processed to compress(bzip) csv file data(400GB or 1.2TB) and write to Postgre in spark standalone cluster.
However, when Spark writes data to Postgresql through JDBC Driver, Spark job's tasks stopped.
I am not sure what task stopped.
enter image description here This picture looks like task status is running, but task has been running about 17hours. It does not finish.
SUMMARY
Why are tasks stop working?
Processing time is too long to compress(bzip) csv datafile. why?
I expect no task stop, to finish between 5 and 6 hours, Perfectly
In addition) ++ 1.spark setting One master - Three worker(each 16core 28GB Mem) cluster spark 3.2.0 standalone
2.spark conf data = {{'kind': 'pyspark', 'driverCores': 5, 'driverMemory': '8G', 'numExecutors': 8, 'executorCores': 6, 'executorMemory': '8G', 'jars': ['local:///usr/local/spark/jars/postgresql-42.2.24.jar'], 'conf': {{'spark.driver.extraClassPath': 'local:///usr/local/spark/jars/postgresql-42.2.24.jar', 'spark.executor.extraJavaOptions': '-Dfile.encoding=UTF-8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'', 'spark.driver.extraJavaOptions': '-Dfile.encoding=UTF-8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'', "spark.yarn.appMasterEnv.LANG": "ko_KR.UTF-8", "spark.yarn.appMasterEnv.PYTHONIOENCODING": "utf8", "spark.dynamicAllocation.enabled": "false", "spark.yarn.driver.memoryOverhead" : "1G", "spark.yarn.executor.memoryOverhead" : "1G", "spark.default.parallelism": 80, "spark.sql.shuffle.partitions": 80, "spark.rdd.compress": true, "spark.shuffle.compress": true, "spark.shuffle.spill.compress": true, "spark.driver.memory": "8G", "spark.executor.memory": "8G", "spark.executor.cores": "5", "spark.executor.instances": "8", "spark.network.timeout": "800s", "spark.executor.heartbeatInterval": "60s", "spark.storage.level": "MEMORY_AND_DISK_SER", "spark.yarn.scheduler.reporterThread.maxFailures": "5", "mapreduce.map.output.compress" : true, "spark.memory.fraction": "0.80", "spark.memory.storageFraction": "0.30", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", }}}}