0

I am trying to work with pySpark on this google public BigQuery table (Table size: 268.42 GB, Number of rows: 611,647,042). I set the region of the cluster to US (the same of the BigQuery table) but the code it's extremely slow even when using several High-performance machines in the cluster. Any idea why? Should I create a copy of the public BigQuery table in my bucket instead? If yes, how?

from pyspark.sql import SparkSession

spark = SparkSession.builder \
  .master('yarn') \
  .appName('spark-bigquery-crypto') \
  .config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
  .getOrCreate()

# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the spark-bigquery-connector.
bucket = "dataproc-staging-us-central1-397704471406-lrrymuq9"
spark.conf.set('temporaryGcsBucket', bucket)

# Load data from BigQuery.
eth_transactions = spark.read.format('bigquery') \
  .option('table', 'bigquery-public-data:crypto_ethereum.transactions') \
  .load()
eth_transactions.createOrReplaceTempView('eth_transactions')

# Perform SQL query.
df = spark.sql('''SELECT * FROM eth_transactions WHERE DATE(block_timestamp) between "2019-01-01" and "2019-01-31"''')
frebls
  • 1

1 Answers1

0

This is because Spark needs to first load your entire table in memory before running the query on Spark executors.

If you run this query directly on the BigQuery UI will be much faster than using Spark.

If your project is simple enough, I would suggest to run SQL queries with a scheduler like Composer. But if you want to stick with Spark, then try to filter as much data as possible on BigQuery.

Will
  • 2,057
  • 1
  • 22
  • 34
  • The problem is that I will need to use complex networks functions from the graphframes Spark package which are not available in SQL and apply them on the whole dataset. – frebls Jul 02 '21 at 16:29
  • Also, even the following would take several minutes to run.. `# Load data from BigQuery. table = 'bigquery-public-data:crypto_ethereum.transactions' eth_transactions = spark.read \ .format('bigquery') \ .option('table', table) \ .option("filter", "DATE(block_timestamp) between '2020-01-01' and '2020-01-31'") \ .load() eth_transactions.createOrReplaceTempView('eth_transactions') # Perform SQL query. eth_data = spark.sql( '''SELECT * FROM eth_transactions limit 10''') eth_data.show()` – frebls Jul 02 '21 at 16:38
  • Then the best option you have is to create a BigQuery table wich is the result of this query, and then process it with Spark. – Will Jul 02 '21 at 16:52
  • @frebls While trying to reproduce your issue, I could run the jobs successfully with the given code snippet on a Dataproc cluster having the default configuration. The jobs took approximately 40 seconds. I could not find any issue with the code. – Krish Jul 08 '21 at 14:41
  • @frebls It appears that this issue has to be investigated further, so if you have a [support plan](https://cloud.google.com/support/) please create a new GCP support case. Otherwise, you can open a new issue on the [issue tracker](https://cloud.google.com/support/docs/issue-trackers). As you will have to share your project ID, open it using [this link](https://cloud.google.com/support/docs/issue-trackers). – Krish Jul 08 '21 at 14:42