Not sure if there is a place/people to ask for one on one advice for Dataproc setup and tuning. But figure here is as good as place as any to find some help.
Our team has been primarily using BigQuery to do our data analysis on location driven data. We're carrying data back to 2019, so we're carry a lot of data. We've added some clustering (always had date partitioning) to help keep cost down, but its getting to the point where it just not feasible. At the moment we have upwards to 200 TB of data and daily raw data ranges from 3-8 TB (gets reduce quite a bit after a few steps).
First we'd like to move our 200 TB of data to GCS and segment it to more granular level. The schema for this data is:
uid -- STRING timestamp_of_observation -- TIMESTAMP, lat -- FLOAT, lon -- FLOAT, datasource -- STRING, cbg (short for census_block_group) -- STRING
We would like to save the data to GCS using hive partitioning so that our bucket folder structure looks like year > month > day > cbg
Knowing we are processing about 200TB and 3 years of data and cbgs alone have about 200,000 possibilities is this feasible? We have a few other options using either census block tracts (84,414 subfolders) or counties (35,000), the more granularity for us the better.
My first attempts I either get just a OOM or I get stages just running forever. My initial pyspark code looks like the following:
from pyspark import SparkFiles
from pyspark.sql.functions import year, month, dayofmonth, rand
from pyspark.sql.functions import col, spark_partition_id, asc, desc
# use appropriate version for jar depending on the scala version
spark = SparkSession.builder\
.appName('BigNumeric')\
.getOrCreate()
spark.conf.set("spark.sql.shuffle.partitions", 365*100)
df = spark.read \
.format("bigquery") \
.load("data-location-338617.Raw_Location_Data.data_cbg_enrich_proto")
df1 = df.withColumn("year", year(col("visit_timestamp"))) \
.withColumn("month", month(col("visit_timestamp"))) \
.withColumn("day", dayofmonth(col("visit_timestamp"))) \
.withColumn("cbg", col("boundary_partition")) \
.withColumn('salt', rand())
df1.repartition(365*100,'salt','year','month','day') \
.drop('salt') \
.write.mode("overwrite") \
.format("parquet") \
.partitionBy("year", "month", "day", "cbg") \
.save("gs://initial_test/cbg_data/")
This code was given to me but a fellow engineer. He told me to add salt for skewness, to increase my partitions.
Any and all advice would be helpful. The goal here to do one huge batch to migrate our data to GCS and then daily begin to save our raw data transformed to GCS as oppose to Bigquery.
I would envision that the file numbers to be written are 31230*200000 (216000000) which seems like a lot. Is there a better way to organize this, our original purpose was to make this data MUCH cheaper downstream to query. Right now the date partition has been the best way to minimize cost, we have clustering on CBG column but it doesn't seem to drive cost down very much. My thought is that with the GCS hive structure, it would essentially make CBG (or other spatial grouping) as a true partition and now just a cluster.
Lastly I"m not doing much to the cluster configuration, I've played around with number of worker nodes and machines but haven't truly gotten anything to work again any help is appreciated and thank you for looking!
This is the cluster setup CLI code
gcloud dataproc clusters create cluster-f35f --autoscaling-policy location_data --enable-component-gateway --bucket cbg-test-patino --region us-central1 --zone us-central1-f --master-machine-type n1-standard-8 --master-boot-disk-type pd-ssd --master-boot-disk-size 500 --num-workers 30 --worker-machine-type n2-standard-16 --worker-boot-disk-type pd-ssd --worker-boot-disk-size 1000 --image-version 2.0-debian10 --optional-components JUPYTER --project data-*********** --initialization-actions gs://goog-dataproc-initialization-actions-us-central1/connectors/connectors.sh --metadata bigquery-connector-version=1.2.0 --metadata spark-bigquery-connector-version=0.21.0