0

I processed to compress(bzip) csv file data(400GB or 1.2TB) and write to Postgre in spark standalone cluster.

However, when Spark writes data to Postgresql through JDBC Driver, Spark job's tasks stopped.

I am not sure what task stopped.

enter image description here This picture looks like task status is running, but task has been running about 17hours. It does not finish.

SUMMARY

  1. Why are tasks stop working?

  2. Processing time is too long to compress(bzip) csv datafile. why?

I expect no task stop, to finish between 5 and 6 hours, Perfectly

In addition) ++ 1.spark setting One master - Three worker(each 16core 28GB Mem) cluster spark 3.2.0 standalone

2.spark conf data = {{'kind': 'pyspark', 'driverCores': 5, 'driverMemory': '8G', 'numExecutors': 8, 'executorCores': 6, 'executorMemory': '8G', 'jars': ['local:///usr/local/spark/jars/postgresql-42.2.24.jar'], 'conf': {{'spark.driver.extraClassPath': 'local:///usr/local/spark/jars/postgresql-42.2.24.jar', 'spark.executor.extraJavaOptions': '-Dfile.encoding=UTF-8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'', 'spark.driver.extraJavaOptions': '-Dfile.encoding=UTF-8 -XX:+UseG1GC -XX:+UnlockDiagnosticVMOptions -XX:+G1SummarizeConcMark -XX:InitiatingHeapOccupancyPercent=35 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:OnOutOfMemoryError='kill -9 %p'', "spark.yarn.appMasterEnv.LANG": "ko_KR.UTF-8", "spark.yarn.appMasterEnv.PYTHONIOENCODING": "utf8", "spark.dynamicAllocation.enabled": "false", "spark.yarn.driver.memoryOverhead" : "1G", "spark.yarn.executor.memoryOverhead" : "1G", "spark.default.parallelism": 80, "spark.sql.shuffle.partitions": 80, "spark.rdd.compress": true, "spark.shuffle.compress": true, "spark.shuffle.spill.compress": true, "spark.driver.memory": "8G", "spark.executor.memory": "8G", "spark.executor.cores": "5", "spark.executor.instances": "8", "spark.network.timeout": "800s", "spark.executor.heartbeatInterval": "60s", "spark.storage.level": "MEMORY_AND_DISK_SER", "spark.yarn.scheduler.reporterThread.maxFailures": "5", "mapreduce.map.output.compress" : true, "spark.memory.fraction": "0.80", "spark.memory.storageFraction": "0.30", "spark.serializer": "org.apache.spark.serializer.KryoSerializer", }}}}

jasonryu
  • 1
  • 1
  • Please clarify your specific problem or provide additional details to highlight exactly what you need. As it's currently written, it's hard to tell exactly what you're asking. – Community Nov 07 '22 at 07:47

0 Answers0