Export file ~40GB in size to IBM cloud object storage

Question

I am using a Python 3.5 and Spark notebook in Watson Studio.

I am trying to export a spark dataframe to a cloud object storage and it keeps failing:

The notebook does not give an error. I have managed to export smaller dataframes without issue.

When I check the object storage there is a partial dataframe in there.

I exported with the following code:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

from ingest.Connectors import Connectors

S3saveoptions = {
      Connectors.BluemixCloudObjectStorage.URL                      : paid_credentials['endpoint'],
      Connectors.BluemixCloudObjectStorage.IAM_URL                  : paid_credentials['iam_url'],
      Connectors.BluemixCloudObjectStorage.RESOURCE_INSTANCE_ID     : paid_credentials['resource_instance_id'],
      Connectors.BluemixCloudObjectStorage.API_KEY                  : paid_credentials['api_key'],
      Connectors.BluemixCloudObjectStorage.TARGET_BUCKET            : paid_bucket,
      Connectors.BluemixCloudObjectStorage.TARGET_FILE_NAME         : "name.csv",
      Connectors.BluemixCloudObjectStorage.TARGET_WRITE_MODE        : "write",
      Connectors.BluemixCloudObjectStorage.TARGET_FILE_FORMAT       : "csv",
      Connectors.BluemixCloudObjectStorage.TARGET_FIRST_LINE_HEADER : "true"}

name = df.write.format('com.ibm.spark.discover').options(**S3saveoptions).save()

It seems you executor dies, most common reasons are it ran out of memory or timed out. Can you provide trace logs? You can obtain it from Spark UI for the cluster instance. One suggestion would be repartition the data so its evenly distributed. `name = df.repartition(300).write.format('com.ibm.spark.discover').options(**S3saveoptions).save()` — Manoj Singh, Dec 05 '18 at 15:24
I'm not too familiar with the Spark APIs, but the most likely reason is that you're running out of memory. Watson Studio currently doesn't provide a notebook runtime with 40 GB of free RAM. So if your code is trying to collect all the data in the notebook kernel before writing it to COS, it's bound to fail. You can of course process 40 GB of data by distributing it across several Spark worker nodes. But then you'll have to let each worker write its own shard of the data directly to COS, instead of gathering everything on a single node. — Roland Weber, Dec 10 '18 at 07:04

Export file ~40GB in size to IBM cloud object storage

0 Answers0