I am trying to work with pySpark on this google public BigQuery table (Table size: 268.42 GB, Number of rows: 611,647,042). I set the region of the cluster to US (the same of the BigQuery table) but the code it's extremely slow even when using several High-performance machines in the cluster. Any idea why? Should I create a copy of the public BigQuery table in my bucket instead? If yes, how?
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.master('yarn') \
.appName('spark-bigquery-crypto') \
.config('spark.jars', 'gs://spark-lib/bigquery/spark-bigquery-latest_2.12.jar') \
.getOrCreate()
# Use the Cloud Storage bucket for temporary BigQuery export data used
# by the spark-bigquery-connector.
bucket = "dataproc-staging-us-central1-397704471406-lrrymuq9"
spark.conf.set('temporaryGcsBucket', bucket)
# Load data from BigQuery.
eth_transactions = spark.read.format('bigquery') \
.option('table', 'bigquery-public-data:crypto_ethereum.transactions') \
.load()
eth_transactions.createOrReplaceTempView('eth_transactions')
# Perform SQL query.
df = spark.sql('''SELECT * FROM eth_transactions WHERE DATE(block_timestamp) between "2019-01-01" and "2019-01-31"''')