0

I have to create the pipeline to transfer the data from BigQuery and save it as json file. But I got this error. The result from sql query is 30 million records. How to improve this code?

Error:

[error] (run-main-0) java.lang.OutOfMemoryError: Java heap space [error]

    object tmp {
      private val logger = LoggerFactory.getLogger(this.getClass)
      var date = "2023-05-22"

      def main(cmdlineArgs: Array[String]){
        val (sc, args) = ContextAndArgs(cmdlineArgs)
        val file_path = "src/main/scala/thunder/tmp.sql"
        val sql_content = Source.fromFile(file_path).mkString
        val queryConfig = QueryJobConfiguration.newBuilder(sql_content).build()
        val client = BigQueryOptions.getDefaultInstance().getService()
        val queryResult = client.query(queryConfig)
        var result = queryResult.iterateAll().iterator().asScala.map(_.asScala.map(_.getValue).toArray).toSeq
        val json_result = result.map { row =>
          val pin_username = row(0).toString
          val feature_name = row(1).toString
          implicit val formats = DefaultFormats
          write(Map(("pin_username"->pin_username),("feature_name" -> feature_name)))

        }
        sc.parallelize(json_result)
          .saveAsTextFile("output", ".json")

        sc.close().waitUntilFinish()
    }
}
James Z
  • 12,209
  • 10
  • 24
  • 44
P.pp
  • 9
  • 2

1 Answers1

0

The error java.lang.OutOfMemoryError: Java heap space is mainly caused by very high memory load on your workers since you are dealing with 30 million records of SQL query. One way to resolve this issue is to increase the workers' memory. It is best to set the size of your worker to n1-highmem-4 or above. This parameter can be used in choosing your machine type: --workerMachineType.

Another workaround is you can also check through heap dump to determine the out-of-memory (OOM) errors when JVM runs out of memory. Re-run the jobs with flags --dumpHeapOnOOM to save the heap dump locally and --saveHeapDumpsToGcsPath=gs://<path_to_a_gcs_bucket> to save it on a specific bucket which you have write permissions on. But using heat dump is only recommended when debugging since it has another cost on top of your current service.

Poala Astrid
  • 1,028
  • 2
  • 10