TLDR
Data Source: GCS
target BigQuery
Problem wildcard reads multiple files, all with same columns, but airport_fee sometimes is of integer and sometimes of double datatype
Error: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary
I am using New York Taxi data for a project. All was going smooth until I tried to load the data from GCS to BigQuery with Spark after applying some light transformations (some type casting, renaming columns and some filters)
the problem is a column called "airport_fee"
apparently the column is int type in some files, and double type in others.
here is a sample
yellow_source = f"gs://{gcp_bucket}/yellow_trip_data/*"
spark = SparkSession \
.builder \
.master('yarn') \
.config("spark.sql.files.ignoreCorruptFiles", "true") \
.config("spark.sql.ansi.enabled", "true") \
.appName('ny_taxi') \
.getOrCreate()
df = spark.read.parquet(yellow_source) \
.withColumn("airport_fee", F.col('airport_fee').cast('double')
df= df.write \
.mode("overwrite") \
.option("overwriteSchema", "true") \
.format("bigquery") \
.option("temporaryGcsBucket", gcs_spark_bucket) \
.option("dataset", staging_dataset) \
.save("bqtb_stg_yellow")
I have hopelessly tried the above, and many other variations(mergeschema, overwriteschema, ignore corrupt files, schema(schema)), even when they didn't make sense. all of them failed.
I think I have only managed to make it work with 1 file at a time, but that would defy the purpose of using Spark wouldn't it? the files are sub 50mb on average. (P.S wrapping the code in a for loop fails after first couple of files anyway)
I have ended up dropping the column all together as I have spent way too much time trying to resolve this, but I do not feel good about that.
any tips would be appreciated.
full error
Caused by: java.lang.UnsupportedOperationException: org.apache.parquet.column.values.dictionary.PlainValuesDictionary$PlainDoubleDictionary